Welcome to the CCMAR Computational Cluster Facility: GYRA - gyra.ualg.pt

System

The GYRA cluster facility consists of:

  • Frontend: 16-core 2.3GHz 32GB DELL PowerEdge R715
  • compute-0-1: 8-core 2.6GHz 8GB DELL PowerEdge SC1435
  • compute-0-2: 32-core 2.4Ghz 64GB DELL PowerEdge R815
  • compute-0-3: 16-core 2.0GHz 32GB DELL PowerEdge R715
  • compute-0-4: 32-core 2.4Ghz 64GB DELL PowerEdge R815
  • compute-0-5: 16-core 3.0Ghz 128GB DELL PowerEdge R815

The clustering software is Rocks 5.3 (Rolled Tacos) with a Open Grid Scheduler/Sun Grid Engine queuing system. A total of 128 slots/cores are available in the queue. The cluster supports mpi, mpich, and orte parrallel computing environments.

Access/Support

The cluster facility is available to members of CCMar and collaborators. To request an account on gyra, or for general enquires, email Cymon.

(If you are looking for a free to access online bioinformatics platform you could try BioPortal , for phylogenetics try CIPRES.)

If you are trying to use Microsoft Windows to connect to GYRA, you could try these instructions for installing the necessary software.

Specifying memory requirements

  • All jobs submitted to the cluster must specifiy a maximum amount of virtual memory be used (h_vmem). By default this is 2GB.
  • If you need more than 2GB for your job, you need to specify the maximum amount using “-l h_vmem=NG” where N is a integer > 2
  • The h_vmem memory value applies per slot/CPU so adjust the h_vmem value accordingly if you are using a parallel process.

To submit a job to the queue that will requires 11G of memory use:

[user@gyra admin]$ qsub -l h_vmem=11G job.sh

or include “#$ -l h_vmem=11G” in the submission script.

Interactive job submission

An interactive session can be request by issuing the command ‘qrsh’ at the prompt. If a slot is available in the queue, an interactive session will be started on an available node. The session will remain active and consuming a single slot until the exit command is issued at the prompt. Do not leave interactive sessions idle as other users will be deprived of resources.

Batch submission

This is the usual job submission procedure - the advantage being that jobs can be queued and run when resources are available. Jobs are submitted to the queue using the command ‘qsub’ which executes a small shell-script describing the requested resources and job configuration. Typically these script are very simple; however for all available options see the SGE Users Manual.

The following describes a simple job submission script and typical configuration options:

#!/bin/bash

# Give the job a name that will appear in the queue (max 9 chars)
#$ -N myJob

# Request all output be placed in the current working directory
#$ -cwd

# Re-direct the 'standard out' and 'standard error' messages to the single file named log
#$ -o log
#$ -j y

# Request a bash shell
#$ -S /bin/bash

# Send email notification when job (e)nds or is (a)borted
#$ -M myEmail@my.account.domain
#$ -m ea

# Optional arguments
# Specifiy a named queue and node:
#$ -q all.q@computer-0-1.local
#Specify the maximum required memory per process
#$ -l h_vmem=10G

# IMPORTANT: source your bash shell profile
source ~/.bash_profile

#Run this command:
paup myAnalysis.nex

All lines beginning with #$ are interpreted as commands by the SGE queue, those beginning with # are comments.

The above script, if in the file named ‘mySub.sh’, would be submitted to the queue using the following command:

  • [user@gyra ~]$ qsub mySub.sh

SGE Users Manual.

Submitting parallel jobs

Some software is parallelised and able to run a single job on muliple processors. To make this work correctly you need to indicate the number of CPU’s to the software AND tell the queue how many slots the job will use.

Submission script requesting 2 CPU’s (-np 2) with the software command “mb runfile.nex”:

#! /bin/bash
#$ -cwd
[...]
mpirun -np 2 mb runfile.nex

If the above submission script is called “mbsub.sh”, when submitting to the queue request the “orte” parallel environment (“-pe orte”) and 2 slots:

  • [user@gyra ~]$ qsub -pe orte 2 mbsub.sh

Mira assemblies: if you indicate SK:not=4 in your Mira command line (ie 4 threads for the SKIM algorithm), submit the job as follows:

  • [user@gyra ~]$ qsub -pe orte 4 my_mira.sh

Yet other programmes (e.g. codonPHYML) have OpenMP support available which will dynamically adjust the number of threads available. Consequently, the number of threads must be limited to the number of CPUs requested in the parallel environent: this is done be setting the env variable ‘OMP_NUM_THREADS’ in the submission script like so for 12 threads:

[...]

#Parallel environment
#$ -pe orte 12

export OMP_NUM_THREADS=12
<OMP software command to run>

Monitoring the queue

The command ‘qstat’ describes the current status of the queue:

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/8/8          9.00     lx26-amd64
    hc:h_vmem=275.720M
  23194 0.55167 smlbrF     cymon        r     07/10/2015 11:29:34     8
---------------------------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/31/32        31.04    lx26-amd64
    hc:h_vmem=0.000
  22502 0.59167 dpplmbrc1  cymon        r     06/09/2015 10:05:18    14
  22505 0.59167 dppmbrc1   cymon        r     06/09/2015 10:11:48    14
  23391 0.60500 muitos_2he leonor       r     07/28/2015 13:27:33     3
---------------------------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/15/16        15.03    lx26-amd64
    hc:h_vmem=1.540G
  23379 0.50500 montipora  regina       r     07/27/2015 12:46:33     1
  23384 0.59167 tmlbr1     cymon        r     07/27/2015 15:33:48    14
---------------------------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/27/32        27.06    lx26-amd64
    hc:h_vmem=7.451G
  22504 0.59167 dpplmbrc2  cymon        r     06/09/2015 10:10:18    14
  23391 0.60500 muitos_2he leonor       r     07/28/2015 13:27:33    13
---------------------------------------------------------------------------------
all.q@compute-0-5.local        BP    0/0/16         4.00     lx26-amd64    d
    hc:h_vmem=84.415G
---------------------------------------------------------------------------------
class@compute-0-5.local        BIP   0/0/12         4.00     lx26-amd64
    hc:h_vmem=84.415G
---------------------------------------------------------------------------------
assembly@compute-0-5.local     BP    0/4/16         4.00     lx26-amd64    d
    hc:h_vmem=84.415G
  23046 0.50500 spCV4r1    cymon        r     07/06/2015 11:33:49     1
  23049 0.50500 spCV4r2    cymon        r     07/06/2015 11:51:34     1
  23051 0.50500 spCV8r1    cymon        r     07/06/2015 11:52:49     1
  23052 0.50500 spCV8r2    cymon        r     07/06/2015 11:53:04     1
---------------------------------------------------------------------------------
head@gyra.local                BP    0/2/12         1.33     lx26-amd64
    hc:h_vmem=22.077G
  23381 0.50500 p4Sr2      cymon        r     07/27/2015 14:35:18     1
  23393 0.50500 blastdb    cymon        r     07/28/2015 15:31:48     1

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
  23392 0.50500 mramos     miguel       qw    07/28/2015 14:42:02     1
  • [user@gyra ~]$ qstat -u <username> - display only those jobs for ‘username’
  • [user@gyra ~]$ qstat -j <jobnumber> - display details of a particular ‘jobnumber’ in the queue

Deleting jobs from the queue

Issuing the commands:

  • [user@gyra ~]$ qdel -u <username> - will delete all jobs of the user ‘username’
  • [user@gyra ~]$ qdel <jobnumber> - will delete ‘jobnumber’ from the queue

Next-generation sequence assembly

Due to the large volumes of data, NGS assembly can require large computation resources, especially RAM memory. Consequently, NGS assembly jobs run within a restricted environment in a special queue called assembly.

Note: In order for users to use the assembly queue, the user must request that they be added to the access group for the queue.

Note that assembly jobs run in the *all.q* will be summarily killed.

Software

The following software is available on the cluster (in no particular order):

Motif and pattern searching

NCBI BLAST (2.2.21(legacy blastall) and 2.3.0+)
Other BLAST databases and custom databases on request. BLAST+ documentation.
  • nt
  • nr
  • refseq_genomic
  • refseq_protein
  • refseq_rna
  • swissprot
  • taxdb
  • bonyfish_proteins (56,707 records)
  • plastid_proteins (83,168 records)
  • plastid_genomes (511 records)
MpiBLAST (1.6.0) See Submitting parallel jobs
mpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. By efficiently utilizing distributed computational resources through database fragmentation, query segmentation, intelligent scheduling, and parallel I/O, mpiBLAST improves NCBI BLAST performance by several orders of magnitude while scaling to hundreds of processors.
  • nr

  • nt

  • bonyfish_proteins (56,707 records)

  • plastid_proteins (83,168 records)

    (other db’s available on request)

HMMER (3.1b2 and 2.3.2)
HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called “profile hidden Markov models” (profile HMMs). Documentation
Infernal (1.1rc3)
Infernal (“INFERence of RNA ALignment”) is for searching DNA sequence databases for RNA structure and sequence similarities. Documentation
TAMO (1.0_120321)
TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs.
SignalP (4.1c)
SignalP 4.1 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
TMHMM (2.0c)
Prediction of transmembrane helices in proteins.
RNAmmer (1.2)
The RNAmmer 1.2 server predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences.
MEME-suite (4.10.2)
Motif-based sequence analysis tools.

Multiple and pair-wise sequence alignment

ClustalW (1.1.18 (clustalw) and 2.0.12 (clustalw2))
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins.
T_Coffee (8.14)
A collection of tools for computing, evaluating and manipulating multiple alignments of DNA, RNA, protein sequences and structures.
Muscle (3.8.31)
“Faster and more accurate than CLUSTALW”...
Uclust (1.0.50 and 1.2.22q (Qiime))
“Search and clustering hundreds of times faster than BLAST”...
Usearch (5.2.32)
USEARCH is a unique high-throughput sequence analysis tool. It supports a variety of algorithms for sequence searching, clustering, and filtering
Vsearch (1.1.3) and vsearch-bz (bzip2 compression version)
The aim of this project is to create an alternative to the USEARCH tool. vsearch-bz can directly read input query and database files that are compressed in bzip2 format.
Mafft (6.833b)
MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of greather than 200 sequences), FFT-NS-2 (fast; for alignment of greather than 10,000 sequences), etc.
GBlocks (0.91b)
Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.
Exonerate (2.2.0)
Exonerate is a generic tool for pairwise sequence comparison.
TranslatorX
A perl script for nucleotide sequence alignment and alignment cleaning based on amino acid information. Uses ReadSeq and GBlocks. Alignments via Muscle, Clustalw, MAFFT, and T-Coffee.
Blat
BLAT (the BLAST-Like Alignment Tool) is a software program developed by Jim Kent at UCSC to identify similarities between DNA sequences and protein sequences.
SEED
SEED is a software for clustering large sets of Next Generation Sequences (NGS) with hundreds of millions of reads in a time and memory efficient manner. Its algorithm joins highly similar sequences into clusters that can differ by up to three mismatches and three overhanging residues. Article.
LASTZ (1.02.00)
LASTZ is a program for aligning DNA sequences, a pairwise aligner. Originally designed to handle sequences the size of human chromosomes and from different species, it is also useful for sequences produced by NGS sequencing technologies such as Roche 454. Documentation
GMAP (13-02-15)
A Genomic Mapping and Alignment Program for mRNA and EST Sequences.

Population genetics / coalescent

PopABC (1.0)
PopABC is an Approximate Bayesian Computation (ABC) method to estimate historical demographic parameters (e.g. population size, migration rate, mutation rate, recombination rate, splitting events) within a Isolation with migration (IM) population model.
Migrate-n
Estimation of population sizes and gene flow using the coalescent.
Lamarc (2.1.6)
LAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates. It approximates a summation over all possible genealogies that could explain the observed sample, which may be sequence, SNP, microsatellite, or electrophoretic data. LAMARC and its sister program Migrate are successor programs to the older programs Coalesce, Fluctuate, and Recombine, which are no longer being supported. Documentation
IMa2 (8/26/2011)
The program implements a method for generating posterior probabilities for complex demographic population genetic models. IMa2 works similarly to the older IMa program, with some important additions. IMa2 can handle data and implement a model for multiple populations (for numbers of sampled populations between one and ten) – not just two populations (as was the case with the original IM and IMa programs).
Bayesian Phylogenetics and Phylogeography (BPP; vers. 3)
Coalescent analysis on a species tree (BP&P and 3s) Documentation

MP-EST (1.5)

The MP-EST method estimates species trees from a set of gene trees by maximizing a pseudo-likelihood function. The program is written in C. The parallel version of the program can run independent searches (chains) in parallel. Each chain starts with a different seed. The program will find the estimate of the species tree with the largest likelihood score across chains. To use MP-EST, you need to create two files; a gene tree file and a control file. Documentation
ASTRAL (github)

ASTRAL is a Java program for estimating a species tree given a set of unrooted gene trees. ASTRAL is statistically consistent under multi-species coalescent model (and thus is useful for handling ILS). It finds the tree that maximizes the number of induced quartet trees in the set of gene trees that are shared by the species tree.

Documentation

Structurama2

Structurama is a program for inferring population structure from genetic data. The program assumes that the sampled loci are in linkage equilibrium and that the allele frequencies for each population are drawn from a Dirichlet probability distribution.

Documentation

ANGSD
ANGSD is a software for analyzing next generation sequencing data.
BayeScan (2.1)
BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations.

Phylogenetic analyses

MrBayes MPI (3.2.6) and MrBayes v3.2-svn(r517- development version)
with Beagle-lib MrBayes is a program for the Bayesian estimation of phylogeny. (See Submitting parallel jobs) Documentation
Phyml (dev)
A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Guindon S., Gascuel O. Systematic Biology, 52(5):696-704, 2003. Documentation
RAxML (7.0.4) and (7.8.4 - github 4th Nov 2013)
Maximum likelihood estimation of phylogeneies. (See Submitting parallel jobs). Documentation (7.0.4) Documentation (newest vers 8+) About parallelization.
ExaML (1.0.0)
Exascale Maximum Likelihood (ExaML) code for phylogenetic inference using MPI. This code implements the popular RAxML search algorithm for maximum likelihood based inference of phylogenetic trees. It uses a radically new MPI parallelization approach that yields improved parallel efficiency, in particular on partitioned multi-gene or whole-genome datasets. Documentation (7.0.4)
FastTree / FastTreeMP (2.1.5)

FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences. FastTree can handle alignments with up to a million of sequences in a reasonable amount of time and memory. For large alignments, FastTree is 100-1,000 times faster than PhyML 3.0 or RAxML 7.

If using FastTreeMP set the enviroment variable OMP_NUM_THREADS in the submission script Documentation

qmmraxmlHPC (1.0)
Uses a class-frequency (cF) mixture model to model site-specific distributions for phylogenetic inference.
Phylip (3.6.8)
All things Felsenstein.
Phylobayes (3.3f)
PhyloBayes is a Bayesian Monte Carlo Markov Chain (MCMC) sampler for phylogenetic reconstruction using protein alignments. Compared to other phylogenetic MCMC samplers, the main distinguishing feature of PhyloBayes is the underlying probabilistic model, CAT (Lartillot and Philippe, 2004). CAT is a mixture model especially devised to account for site-specific features of protein evolution. It is particularly well suited for large multigene alignments, such as those used in phylogenomics. Documentation
PhyloBayes MPI (1.5a)
PhyloBayes-MPI is a Bayesian Markov chain Monte Carlo (MCMC) sampler for phyloge- netic inference exploiting a message-passing-interface system for multi-core computing. Documentaton
NH Phylobayes (0.2.3)
A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Some notes on running nhpb
p4 (1.0) (python2.7.9 : numpy-1.9.1 : gsl-1.14)

P4 is a Python package that does maximum likelihood and Bayesian phylogenetic analyses on molecular sequences. It’s specialty is that you can use heterogeneous models, where the model parameters can differ in different parts of the tree, or over different parts of the data.

Includes Qdist module - Installation Documentation

DendroPy (3.12.0)
DendroPy is a Python library for phylogenetic computing. It provides classes and functions for the simulation, processing, and manipulation of phylogenetic trees and character matrices, and supports the reading and writing of phylogenetic data in a range of formats, such as NEXUS, NEWICK, NeXML, Phylip, FASTA, etc. Application scripts for performing some useful phylogenetic operations, such as data conversion and tree posterior distribution summarization, are also distributed and installed as part of the libary. DendroPy can thus function as a stand-alone library for phylogenetics, a component of more complex multi-library phyloinformatic pipelines, or as a scripting “glue” that assembles and drives such pipelines. Tutorial
BayesTraits
BayesTraits is a computer package for performing analyses of trait evolution among groups of species for which a phylogeny or sample of phylogenies is available. This new package incoporates our earlier and separate programes Multistate, Discrete and Continuous. BayesTraits can be applied to the analysis of traits that adopt a finite number of discrete states, or to the analysis of continuously varying traits. Hypotheses can be tested about models of evolution, about ancestral states and about correlations among pairs of traits.
BayesPhylogenies (1.0)
BayesPhylogenies is a general package for inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo (MCMC) or Metropolis-coupled Markov chain Monte Carlo (MCMCMC) methods. The program allows a range of models of gene sequence evolution, models for morphological traits, models for rooted trees, gamma and beta distributed rate-heterogeneity, and implements a ‘mixture model’ (Pagel and Meade, 2004) that allows the user to fit more than one model of sequence evolution, without partitioning the data.
Beast (2.1.3) with Beagle-lib
BEAST is a cross-platform program for Bayesian MCMC analysis of molecular sequences. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability. Installation and use of Beagle-lib with Beast Plugins: SNAPP (1.1.5) Templates StarBEAST tutorial BUG in *BEAST
Modelgenerator (85)
ModelGenerator is a model selection program that selects optimal amino acid and nucleotide substitution models from Fasta or Phylip alignments. ModelGenerator supports 56 nucleotide and 96 amino acid substitution models.
Jmodeltest (0.1.1)
jModelTest is a tool to carry out statistical selection of best-fit models of nucleotide substitution. It implements five different model selection strategies: hierarchical and dynamical likelihood ratio tests (hLRT and dLRT), Akaike and Bayesian information criteria (AIC and BIC), and a decision theory method (DT). It also provides estimates of model selection uncertainty, parameter importances and model-averaged parameter estimates, including model-averaged phylogenies.
MrModeltest2 (2.3)
C program for selecting DNA substitution models using PAUP*.
ModelOMatic (1.0)
ModelOMatic is a C++ program designed for rapid phylogenetic model selection on protein coding genes. Please see the manual for details of the program and its settings. Documentation. Examples.
Garli (vers 1.0 and vers 2.0 MPI)
GARLI (Genetic Algorithm for Rapid Likelihood Inference) performs phylogenetic searches on aligned nucleotide, codon and amino acid data sets using the maximum likelihood criterion. On a practical level, the program is able to perform maximum-likelihood tree searches on large data sets in a number of hours.
Prottest (2.4)
PROTTEST (ModelTest’s relative) is a program for selecting the model of protein evolution that best fits a given set of sequences (alignment). This java program is based on the Phyml program (for maximum likelihood calculations and optimization of parameters) and uses the PAL library as well. Models included are empirical substitution matrices (such as WAG, LG, mtREV, Dayhoff, DCMut, JTT, VT, Blosum62, CpREV, RtREV, MtMam, MtArt, HIVb, and HIVw) that indicate relative rates of amino acid replacement, and specific improvements (+I:invariable sites, +G: rate heterogeneity among sites, +F: observed amino acid frequencies) to account for the evolutionary constraints impossed by conservation of protein structure and function. ProtTest uses the Akaike Information Criterion (AIC) and other statistics (AICc and BIC) to find which of the candidate models best fits the data at hand.
PAUP (4.0b10)
Needs no introduction. - 4.0 final release expected any day now...
PAML (4.6)
PAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood. It is maintained and distributed for academic use free of charge by Ziheng Yang. ANSI C source codes are distributed for UNIX/Linux/Mac OSX, and executables are provided for MS Windows. PAML is not good for tree making. It may be used to estimate parameters and test hypotheses to study the evolutionary process, when you have reconstructed trees using other programs such as PAUP*, PHYLIP, MOLPHY, PhyML, RaxML, etc.
Tree-Puzzle (5.2)
TREE-PUZZLE is a computer program to reconstruct phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of support to each internal branch. TREEPUZZLE also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees. Branch lengths can be calculated with and without the molecular-clock assumption. In addition, TREE-PUZZLE o ers likelihood mapping, a method to investigate the support of a hypothesized internal branch without computing an overall tree and to visualize the phylogenetic content of a sequence alignment. TREE-PUZZLE also conducts a number of statistical tests on the data set (chi-square test for homogeneity of base composition, likelihood ratio to test the clock hypothesis, one and two-sided Kishino-Hasegawa test, Shimodaira-Hasegawa test, Expected Likelihood Weights).
Consel (0.20)
CONSEL is a program package consists of small programs written in C language. It calculates the probability value (i.e., p-value) to assess the confidence in the selection problem. Although CONSEL is applicable to any selection problem, it is mainly designed for the phylogenetic tree selection. CONSEL calculates the p-value using several testing procedures; the bootstrap probability, the Kishino-Hasegawa test, the Shimodaira-Hasegawa test, and the weighted Shimodaira-Hasegawa test. In addition to these conventional tests, CONSEL calculates the p-value based on the approximately unbiased test using the multi-scale bootstrap technique. Documentation
Tree Congruence Tester (tct.py) - (Requires p4)
Tree Congruence Test(er): reads two rooted trees (NEXUS and/or PHYLIP format), reciprocally prunes each tree of missing taxa (automatic), deletes any taxa passed to the programme (via -d), and tests topological congruence among the remaining clades supported by a value greater than the set threshold.
combineNexus.py - (Requires p4 and >= Python 2.7)
Reads Nexus formatted matrices and combines them into a single matrix with blank sequences for genes that are missing from individual matrices.
calculatePhylogeneticDiversity.py - (Requires p4 and >= Python 2.7)
Calculate the Phylogenetic Diversity (PD: Faith 1992) of a group of taxa on a tree. PD is the minimum total length of all the phylogenetic branches required to span a given set of taxa on the phylogenetic tree (and does not include the stem branch of a clade).
makeConsensusTree.py - (Requires p4 and >= Python 2.7)
Write a majority rule consensus tree from 1 or more Nexus or Newick formatted tree files. Each tree file can have a specified burnin and/or step count. The data file (in Phylip or Nexus format) from which the trees are derived must be supplied.
Update_BEAST_operators.py
Update the BEAST operator tuning values in the XML run file based on the suggestions output at the end of a previous MCMC in the log file. Only BEAST version 2.3.0 supported.
Concaterpillar (1.4)
- (Requires SCIPY SciPy and pyMPI) A hierarchical likelihood ratio test for phylogenetic congruence. Documentation (See Submitting parallel jobs)
minmax-chisq
Reduced amino acid alphabets for phylogenetic inference. Documentation
Crux (1.2.0)
Crux is a software toolkit for molecular phylogenetic inference. Incl: Bayesian Markov chain Monte Carlo (MCMC) methods (with Metropolis coupling and MPI support for parallel computation) can sample among non-nested models using reversible model jumps. Polytomous trees can be sampled, also via reversible jumps. In fact, every non-essential model parameter that Crux’s MCMC implementation estimates can be expunged via reversible jumps. Notes on running Crux.
Figtree (1.3.1)
FigTree is designed as a graphical viewer of phylogenetic trees and as a program for producing publication-ready figures.
Misfits (1.0)
MISFITS is a program to evaluate the goodness of fit of a model to an alignment in phylogeny reconstruction. Documentation
Compass (1.0)
COMPASS is a UNIX program for identifying and removing fast evolving sites from morphological or molecular data matrices using a number of compatibility-based methods. Documentation
AIS: Almost Invariant Sets
The goal is to identify sets of amino acids with a high probability of change between elements of the set but small probability of change between different sets by using amino acid replacement matrices and their eigenvectors. After identification of the subsets the quality of the partition is assessed with a conductance measure. Documentation
TreeFinder
TREEFINDER computes phylogenetic trees from molecular sequences. The program infers even large trees by maximum likelihood under a variety of models of sequence evolution. Documentation
LineageSpecificSeqgen (timestamp: Mar 16 2011)
LineageSpecificSeqgen is an extension to the seq-gen program that allows generation of sequences with both changes in the proportion of variable sites and changes in the rate at which sites switch between being variable and invariable.
INDELible (1.03)
INDELible is a new, portable, and flexible application for biological sequence simulation that combines many features in the same place for the first time. Using a length-dependent model of indel formation it can simulate evolution of multi-partitioned nucleotide, amino-acid, or codon data sets through the processes of insertion, deletion, and substitution in continuous time. Nucleotide simulations may use the general unrestricted model or the general time reversible model and its derivatives, and amino-acid simulations can be conducted using fifteen different empirical rate matrices. Substitution rate heterogeneity can be modelled via the continuous and discrete gamma distributions, with or without a proportion of invariant sites. INDELible can also simulate under non-homogenous and non-stationary conditions where evolutionary models are permitted to change across a phylogeny. Unique among indel simulation programs, INDELible offers the ability to simulate using codon models that exhibit nonsynonymous/synonymous rate ratio heterogeneity among sites and/or lineages. Documentation
PHASE (2.0)
This package is designed specifically for use with RNA sequences that have a conserved secondary structure, e.g., rRNA and tRNA. It is well known that compensatory substitutions occur in the paired regions of RNA secondary structures; this means that substitutions occurring on one side of a pair are correlated with substitutions on the other side. Most phylogenetic programs assume that each site in a molecule evolves independently of the others but this assumption is not valid for RNA genes. Documentation
BEST (2.3.1)
BEST is a free phylogenetics program written by Liang Liu to estimate the joint posterior distribution of gene trees and species tree using multilocus molecular data that accounts for deep coalescence but not for other issues such as horizontal transfer or gene duplication.
CodonPHYML (1.00 201306.18)
CodonPhyML uses Markovian codon models of evolution in phylogeny reconstruction. Given a set of species characterized by their DNA sequences as input, codonPhyML will return the phylogenetic tree that best describes their evolutionary relationship. OMP support, no BLAS/LAPACK. Documentation

SymTest

R functions: Test for Symmetry of Matched DNA Sequences, Overall Test for Marginal Symmetry
fastCodeML
FastCodeML is a software to infer positive selection along positions of a protein coding gene using the Branch-Site model of evolution. By using an hybrid (OpenMP/MPI) strategy, FastCodeML can reach a speed-up of up to 10 times compared with codeml.
PhyML-4X
LG4X: Modeling Protein Evolution with Several Amino-Acid Replacement Matrices Depending on Site Rates
DPPDiv (1.0b)
DPPDiv is a program for estimating species divergence times and lineage-specific substitution rates on a fixed topology. The prior on branch rates is a Dirichlet process prior which clusters branches into distinct rate classes.
PLL-DPPDIV
We present a substantially improved and parallelized version of DPPDiv, a software tool for estimating species divergence times and lineage-specific substitution rates on a fixed tree topology.
FDPPDiv
Fossilised Birth-Death Model of T. Heath - development version of DPPDiv
RogueNaRok
A versatile and scalable algorithm for rogue taxon identification.
ccdprobs
Software to estimate probabilities of tree topologies using conditional clade distributions.
BUCKy
BUCKy estimates the dominant history of sampled individuals, and how much of the genome supports each relationship, using Bayesian concordance analysis. BUCKy does not assume that genes (or loci) all have the same topology. Instead, groups of genes sharing the same tree are detected (while accounting for uncertainty in gene tree estimates), and then combined to gain more resolution on their common tree. No assumption is made regarding the reason for discordance among gene trees. Documentation

Genome assembly

Roche 454 (FLX Titanium)

Roche 454 Data Analysis suite (2.9_All_20130530_1559)
Obtain biologically meaningful results from your sequence data quickly and affordably with the powerful suite of analysis tools provided with the Genome Sequencer FLX System, updated for the GS FLX Titanium series. AKA Newbler et al. (See NGS assembly)
Octupus (0.1.1)
OCTUPUS uses a novel method of sequence clustering and pairwise comparisons which reduces the influence of chimeras and intraspecific diversity on cluster generation. The clustering approach used to generate OCTUs is designed with the intent to reflect the expected pattern of diversity of rDNA genes. Additionally, OCTUPUS provides a method to screen clusters for evidence of chimera formation without the use of a reference database. OCTUPUS is optimized for speed, and does not require a computing cluster to analyze typical large scale datasets.
PRICE (1.2)
PRICE (Paired-Read Iterative Contig Extension), a de novo genome assembler implemented in C++. Its name describes the strategy that it implements for genome assembly: PRICE uses paired-read information to iteratively increase the size of existing contigs. Initially, those contigs can be individual reads from a subset of the paired-read dataset, non-paired reads from sequencing technologies that provide non-paired data, or contigs that were output from a prior run of PRICE or any other assembler. Documentation

Illumina (HiSeq 2000, GAIIx, GAIIe)

ABySS (1.5.2)

(MPI with sparsehash - max kmer 96) ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes. (See NGS assembly)

  • the default abyss binaries are compiled with the default 64 max kmer value A second compiled version of the binaries with –enable-maxk=32 can be found at /share/apps/abyss32k/bin/* - these should be more memory efficient
Trans-ABySS (1.4.4)
Trans-ABySS is a software pipeline for analyzing ABySS-assembled contigs from shotgun transcriptome data. The pipeline accepts assemblies that were generated across a wide range of k values in order to address variable transcript expression levels. It first filters and merges the multi-k assemblies, generating a much smaller set of nonredundant contigs. It contains scripts that map assembled contigs to known transcripts, currently supporting Blat and Exonerate contig-to-genome aligners. It identifies novel splicing events like exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. Its scripts can also estimate gene expression levels, identify candidate polyadenylation sites, and identify candidate gene-fusion events. Documentation (See NGS assembly)
SOAP

SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It is an updated version of SOAP software for short oligonucleotide alignment. The new program features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer.

  • soap2sam.pl
  • 2bwt-builder
  • SOAPdenovo2 (r240): A de novo short reads assembler. - SOAPdenovo-127mer - SOAPdenovo-63mer
  • soap (SOAPaligner/soap2) (2.20): Short Oligonucleotide Analysis Package.
SOAPdenovo-Trans (1.03)
De novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.
Minia (2.0.2)

Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet).

THIS ASSEMBLER IS VERY MEMORY EFFICIENT! but doesn’t use the pairing information (but will accept the data)

Documentation FAQ

Maq (0.7.1)
Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data. Documentation
Velvet (1.2.10)

Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs. MAXKMERLENGTH=64, LONGSEQUENCES=1 Documentation (See NGS assembly)

VelvetOptimiser.pl (2.2.4)
VelvetOptimiser is a multi-threaded Perl script for automatically optimising the three primary parameter options (K, -exp_cov, -cov_cutoff) for the Velvet de novo sequence assembler. Documentation
Trinity (2.0.6) - RNA-Seq De novo Assembly
A novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.
ALLPATHS-LG (42816 - 5th Sept 2012)

Short read genome assembler from the Computational Research and Development group at the Broad Institute. Documentation

Note that the libraries have to be of specific kinds... see documentation.

Spades (3.5.0)
A new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E + V - SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). Documentation
MetaVelvet (1.2.01)
A short read assember for metagenomics. MAXKMERLENGTH=64 CATEGORIES=4
IDBA (1.1.1)
IDBA is a practical iterative De Bruijn Graph De Novo Assembler for sequence assembly in bioinfomatics.
IDBA-UD (1.1.1)
IDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth (as in metagenomics).
IDBA-Tran (1.1.1)
IDBA-Tran is an iterative De Bruijn Graph De Novo short read assembler for transcriptome. It is purely de novo assembler based on only RNA sequencing reads.
IDBA-Hybrid (1.1.1)
IDBA-Hybrid is an iterative De Bruijn Graph De Novo Assembler for hybrid sequencing. It is an extension of IDBA-UD algorithm.

Hybrid 454 and Illumina

Celera (7.0)
Celera Assembler is a de novo whole-genome shotgun (WGS) DNA sequence assembler. It reconstructs long sequences of genomic DNA from fragmentary data produced by whole-genome shotgun sequencing
Mira (4.0.2)(+3rd party)

The mira genome fragment assembler is a specialised assembler for sequencing projects classified as ‘hard’ due to high number of similar repeats. For EST transcripts, miraEST is specialised on reconstructing pristine mRNA transcripts while detecting and classifying single nucleotide polymorphisms (SNP) occuring in different variations thereof. Online wiki Documentation (See NGS assembly) The Definitive Guide to Mira3

Note that by default Mira uses 2 threads in the SKIM algorithm, so if no more threads are requested, then typically you would submit a job requesting 2 slots in the queue -pe orte 2 - see Submitting parallel jobs

Ray (1.6.0)
Ray is a parallel genome assembler utilizing MPI. Ray is a single-executable program (the executable is Ray). Its aim is to assemble sequences on MPI-enabled computers or clusters. Ray assembles reads obtained with new sequencing technologies (Illumina, 454, SOLiD) using MPI 2.2 – a message passing inferface standard. (See NGS assembly) MAXKMERLENGTH=32
Oases (0.2.08)
Oases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly. MAXKMERLENGTH=64 Documentation (See NGS assembly)
AMOS (3.1.0)
Includes:

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal – to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system. Notes on installation.

LOCAS (0.1.7) and SUPERLOCAS (0.0.2)
LOCAS is a programm to assemble short reads of second generation sequencing technologies. It explicitly handles low coverage data by allowing mismatches in the overlap alignment of reads. Documentation

Sequence read manipulation / cleaning / error corrections (kmer analysis)

sff_extract (0.3.0)
454 sequence reads are usually stored in sff files. In these files the information about the reads is stored: sequece, quality and quality and adapter clips. sff_extract extracts the reads from the sff files and stores them into fasta and xml or caf text files.
Smalt (0.4.1)
SMALT is a pairwise sequence alignment program designed for the efficient mapping of DNA sequencing reads onto genomic reference sequences. Reads from a range of sequencing platforms, for example Illumina-Solexa, Roche-454 or ABI-Sanger, can be processed including paired-end reads. Documentation
Lucy (1.20)
Lucy is a program for DNA sequence quality trimming and vector removal. Its purpose is to process DNA sequence data acquired from DNA sequencers to prepare the data for downstream processing applications such as genome assembly.
Amplicon Noise (1.2.1)
AmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.
clean_reads (0.2.3) and NGS Backbone (1.4.0)

clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim:

  • bad quality regions,
  • adaptors,
  • vectors, and
  • regular expresssions.

It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality. It uses several algorithms and third party tools to carry out the cleaning. The third party tools used are: lucy, blast, mdust and trimpoly. The functionality offered by clean_reads is similar to the cleaning capabilities of the ngs_backbone pipeline. In fact, both tools use the same code base and are just different interfaces on top of a Python library called franklin. Can be parallelised with psubprocess.

cutadapt (1.8.1)

cutadapt removes adapter sequences from high-throughput sequencing data. This is usually necessary when the read length of the sequencing machine is longer than the molecule that is sequenced, for example when sequencing microRNAs.

Documentation

prinseq-lite (0.14.4)
PRINSEQ is a publicly available tool that is able to filter, reformat and trim your sequences and to provide you summary statistics for your sequence data. Documentation
DeconSeq
Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants presents a necessary step for all metagenomic projects.
FastQC (0.10.1)
A quality control tool for high throughput sequence data.
ChimeraSlayer (MicrobiomeUtilities-r20110519)
A set of software utilities for processing and analyzing 16S rRNA genes including generating NAST alignments, chimera checking, and assembling paired 16S rRNA reads according to reference sequence homology.
Trim Galore! (0.3.2)
Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing). Documentation
Seq_filter.pl
Filters sequences?
cutseq_fasta.pl
Takes a large fasta file and cuts a subset of sequences to make a second fasta file.
Bedtools (2.20.1)
Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.
Pairfq
Sync paired-end FASTA/Q files and keep singleton reads
SOAPec (2.01)
The read correction package is a short-read correction tool and part of SOAPdenovo. It is specially designed to correct Illum ina GA short reads.
kmergenie (1.6972)

merGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number. Experiments show that KmerGenie’s choices lead to assemblies that are close to the best possible over all k-mer lengths.

FAQ

Alignment, mapping, and scaffolding

MUMmer (3.23)
MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. Dependency for AMOS.
TopHat (2.0.13)
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. Documentation Installation
Bowtie (1.1.1)
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end). Documentation (See NGS assembly)
Bowtie2 (2.2.4)
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes. Manual
Cufflinks (2.2.1)
Transcript assembly, differential expression, and differential regulation for RNA-Seq. Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one. Documentation
CAP3
CAP3 Sequence Assembly Program. Documentation
BWA (0.7.8-r455)
BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (target), such as the human reference genome.
BFAST (0.6.4e)
BFAST facilitates the fast and accurate mapping of short reads to reference sequences. Some advantages of BFAST include: * Speed: enables billions of short reads to be mapped quickly. * Accuracy: A priori probabilities for mapping reads with defined set of variants. * An easy way to measurably tune accuracy at the expense of speed. Specifically, BFAST was designed to facilitate whole-genome resequencing, where mapping billions of short reads with variants is of utmost importance. BFAST supports both Illumina and ABI SOLiD data, as well as any other Next-Generation Sequencing Technology (454, Helicos), with particular emphasis on sensitivity towards errors, SNPs and especially indels. Other algorithms take short-cuts by ignoring errors, certain types of variants (indels), and even require further alignment, all to be the “fastest” (but still not complete). BFAST is able to be tuned to find variants regardless of the error-rate, polymorphism rate, or other factors.
PYNAST (1.1)
PyNAST is a python implementation of the NAST sequence alignment tool.
dDocent.FB
dDocent was designed as a scripted software pipeline made for analyzing double digest RAD or ezRAD data in highly polymorphic marine species.
SSPACE
SSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data. It is unique in offering the possibility to manually control the scaffolding process. By using the distance information of paired-end and/or matepair data, SSPACE is able to assess the order, distance and orientation of your contigs and combine them into scaffolds. Manual Tutorial

Annotation

Glimmer (3.02)
Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. (See NGS assembly)
Blast2GO (2.3.5)
Command line version only: b2g4pipe, no visualisation. Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Documentation Configuration
martyr.py (1.0)
A Python version of MARTA. Annotates BLAST (x or n) XML output with the NCBI taxonomy. Requires: Biopython, blastdbcmd, NCBI taxonomy dump and the target BLAST DB in $BLASTDB, and a BioSQL database in PostGreSQL with the NCBI taxonomy loaded.
RDP Classifier
The RDP Classifier is a naive Bayesian classifier that can rapidly and accurately provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. More information can be found at Ribosomal Database Project
Prodigal (2.61)
Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program. Prodigal can run in metagenomic mode and analyze sequences even when the organism is unknown.
Prokka (1.7.2)
Prokka is a software tool for the rapid annotation of prokaryotic genomes. A typical 4 Mbp genome can be fully annotated in less than 10 minutes on a quad-core computer, and scales well to 64 core SMP systems. It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA. Documenation
Trinotate (2.0.2)

Trinotate is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms. Trinotate makes use of a number of different well referenced methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt/Uniref90), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (singalP/tmHMM), and comparison to currently curated annotation databases (EMBL Uniprot eggNOG/GO Pathways databases). All functional annotation data derived from the analysis of transcripts is integrated into a SQLite database which allows fast efficient searching for terms with specific qualities related to a desired scientific hypothesis or a means to create a whole annotation report for a transcriptome.

Software specific databases in $BLASTDB: SwissProt Uniref90 and Pfam

TransDecoder (2.1.0)
TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.

Visualisation

Integrated Genome Browser (6.7.0)
The Integrated Genome Browser (IGB, pronounced Ig-Bee) is an interactive, zoomable, scrollable software program you can use to visualize and explore genome-scale data sets, such as tiling array data, next-generation sequencing results, genome annotations, microarray designs, and the sequence itself. Documenation
Tablet - Next Generation Sequence Assembly Visualization
Tablet is a lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments.
Mauve (2.3.1) - Multiple Genome Alignment
Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences. Documentation

Bioinformatics suites and libraries

Qiime (1.9.0)
QIIME (pronounced chime) stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rRNA) generated on a variety of platforms, but also supporting analysis of other types of data (such as shotgun metagenomic data). QIIME takes users from their raw sequencing output through initial analyses such as OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics. QIIME has been applied to single studies based on billions of sequences from thousands of samples. Notes on installation.
Mothur (1.33.3)
Bioinformatics for the microbial ecology community. MPI version.
PyCogent (1.5.1)
PyCogent: A toolkit for making sense from sequence. PyCogent includes connectors to remote databases, built-in generalized probabilistic techniques for working with biological sequences, and controllers for 3rd party applications.
Emboss (6.4.0)
EMBOSS is “The European Molecular Biology Open Software Suite”.
FASTA
The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
Biopython (1.65) - (Python2.7)
Biopython is a set of freely available tools for biological computation written in Python. Documentation
Wise2 (2.2.0)
Wise2 is a package focused on comparisons of bio polymers, commonly DNA sequence and protein sequence. There are many other packages which do this, probably the best known being BLAST package (from NCBI) and the FASTA package (from Bill Pearson).
NCL (2.1.14)
The NEXUS Class Library (NCL) is an integrated collection of C++ classes designed to allow the user to quickly write a program that reads NEXUS-formatted data files. It also allows easy extension of the NEXUS format to include new blocks of your own design.

R statistics (3.2.3)

R is a language and environment for statistical computing and graphics. Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. EdgeR - differential expression analysis of RNA-seq and digital gene expression profiles with biological replication. Uses empirical Bayes estimation and exact tests based on the negative binomial distribution. Also useful for differential signal analysis with other types of genome-scale count data.

BioPerl
BioPerl is an extensive set of bioinformatics libraries written in Perl.
Pysam (0.6)
Pysam is a python module for reading and manipulating Samfiles. It’s a lightweight wrapper of the samtools C-API.
Samtools (1.1)
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SciPy (0.9.0)
SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering.The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Installation with ATLAS and complete LAPACK library
NCBI Sequence Read Archive (SRA) Toolkit (2.0rc5)
Stuff from NCBI to manipulate SRAs. Documentation
FASTX-toolkit (0.0.14)

The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

  • FASTQ-to-FASTA converter Convert FASTQ files to FASTA files.
  • FASTQ Information Chart Quality Statistics and Nucleotide Distribution
  • FASTQ/A Collapser Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
  • FASTQ/A Trimmer Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).
  • FASTQ/A Renamer Renames the sequence identifiers in FASTQ/A file.
  • FASTQ/A Clipper Removing sequencing adapters / linkers
  • FASTQ/A Reverse-Complement Producing the Reverse-complement of each sequence in a FASTQ/FASTA file.
  • FASTQ/A Barcode splitter Splitting a FASTQ/FASTA files containning multiple samples
  • FASTA Formatter changes the width of sequences line in a FASTA file
  • FASTA Nucleotide Changer Convets FASTA sequences from/to RNA/DNA
  • FASTQ Quality Filter Filters sequences based on quality
  • FASTQ Quality Trimmer Trims (cuts) sequences based on quality
  • FASTQ Masker Masks nucleotides with ‘N’ (or other character) based on quality
pyfasta (0.4.3)
Fast, memory-efficient, pythonic (and command-line) access to fasta sequence files.
Bio++ (core-2.0.1)
Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics. Bio++ is fully Object Oriented and is designed to be both easy to use and computer efficient. Installation Documentation
PICARD (1.85)
Picard comprises Java-based command-line utilities that manipulate SAM files.
VCFTools (0.1.11)
VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project.
Seqtk
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
CD-HIT
CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences.
vcflib
A simple C++ library for parsing and manipulating VCF files, + many command-line utilities.
GATK
A Toolkit for Genome Analysis.
htslib
C library for high-throughput sequencing data formats.
seq_crumbs (0.1.9)
Seq_crumbs aims to be a collection of small sequence processing utilities.
Trimmomatic (0.33)
A flexible read trimming tool for Illumina NGS data Documentation
Bamtools (2.3.0)
C++ API & command-line toolkit for working with BAM data

Other stuff

Seqmagick (0.3.1)
Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way.
cdbfasta
Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to pull records based on that index file.
ACEAssemblySplitter.py (1.0)
Split an ACE format assembly file into multiple single ACE files each with a single contig. The resulting ACE contig files are formatted so that they can be read by CodonCode.
SOLiD SAGE Analysis Software (v1.10)
SOLiD™ SAGE™Analysis Software v1.10 is a Linux-based program that takes the raw data files from SOLiD™ SAGE™ sequencing reads and matches them to known sequences in your reference database of choice. It is designed for use with the SOLiD™ SAGE™ Kit or the SOLiD™ SAGE™Kit with Barcoding Adaptor Module, which generates libraries of 27-bp tags for all transcripts in a cell. Documentation
GRAPPA (2.0)
Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms
GRAPPA-IR
Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms - for chloroplast genomes with inverted repeats.
MGR (2.01)
A tool for constructing phylogenies based on gene order for unichromosomal and multichromosomal genomes
freebayes (v0.9.21-19-gc003c1e-dirty)
Bayesian haplotype-based polymorphism discovery and genotyping.
Stacks (1.19)
Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.
Rainbow (2.0.3)
Efficient tool for clustering and assembling short reads, especially for RAD.
mergefq.pl
Part of dDocent.
VarScan (2.3.6)
Variant detection in next-generation sequencing data.

Molecular modeling

MODELLER (9v8)
MODELLER is used for homology or comparative modeling of protein three-dimensional structures (1,2). The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms. MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints (3,4), and can perform many additional tasks, including de novo modeling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures, etc.
NAMD (2.7 MPI)
A parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Documentation (See Submitting parallel jobs)
Amber11 and AmberTools
Assisted Model Building with Energy Refinement. Documentation Amber11 and Documentation AmberTools
Gromacs (4.5.2)
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. Documentation
PSIPRED V32
The PSIPRED Protein Structure Prediction Server aggregates several of our structure prediction methods into one location.