.. gyra-homepage documentation master file, created by
sphinx-quickstart on Fri Apr 1 12:35:41 2011.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to the `CCMAR `_ Computational Cluster Facility: GYRA - gyra.ualg.pt
=======================================================================================================
.. toctree::
:maxdepth: 3
System
------
The GYRA cluster facility consists of:
- Frontend: 16-core 2.3GHz 32GB DELL PowerEdge R715
- compute-0-1: 8-core 2.6GHz 8GB DELL PowerEdge SC1435
- compute-0-2: 32-core 2.4Ghz 64GB DELL PowerEdge R815
- compute-0-3: 16-core 2.0GHz 32GB DELL PowerEdge R715
- compute-0-4: 32-core 2.4Ghz 64GB DELL PowerEdge R815
- compute-0-5: 16-core 3.0Ghz 128GB DELL PowerEdge R815
The clustering software is
`Rocks 5.3 (Rolled Tacos) `_
with a `Open Grid Scheduler/Sun Grid Engine `_
queuing system.
A total of 128 slots/cores are available in the queue. The cluster
supports mpi, mpich, and orte parrallel computing environments.
.. The current system status can be viewed
`here `_.
Access/Support
--------------
The cluster facility is available to members of CCMar and collaborators. To
request an account on gyra, or for general enquires, email Cymon.
(*If you are looking for a free to access online bioinformatics platform you
could try* `BioPortal `_ *, for phylogenetics try*
`CIPRES `_.)
If you are trying to use *Microsoft Windows* to connect to GYRA, you could try these
`instructions `_ for installing the necessary software.
.. _memory:
Specifying memory requirements
------------------------------
- All jobs submitted to the cluster must specifiy a maximum amount of virtual memory
be used (h_vmem). By default this is 2GB.
- If you need more than 2GB for your job, you need to specify the maximum
amount using "-l h_vmem=NG" where N is a integer > 2
- The *h_vmem* memory value applies **per slot/CPU** so adjust the *h_vmem*
value accordingly if you are using a parallel process.
To submit a job to the queue that will requires 11G of memory use:
::
[user@gyra admin]$ qsub -l h_vmem=11G job.sh
or include "#$ -l h_vmem=11G" in the submission script.
Interactive job submission
~~~~~~~~~~~~~~~~~~~~~~~~~~
An interactive session can be request by issuing the command 'qrsh'
at the prompt. If a slot is available in the queue, an interactive
session will be started on an available node. The session will
remain active and consuming a single slot until the exit command is
issued at the prompt. Do not leave interactive sessions idle as
other users will be deprived of resources.
Batch submission
~~~~~~~~~~~~~~~~
This is the usual job submission procedure - the advantage being
that jobs can be queued and run when resources are available. Jobs
are submitted to the queue using the command 'qsub' which executes
a small shell-script describing the requested resources and job
configuration. Typically these script are very simple; however for
all available options see the
`SGE Users Manual `_.
The following describes a simple job submission script and typical
configuration options:
::
#!/bin/bash
# Give the job a name that will appear in the queue (max 9 chars)
#$ -N myJob
# Request all output be placed in the current working directory
#$ -cwd
# Re-direct the 'standard out' and 'standard error' messages to the single file named log
#$ -o log
#$ -j y
# Request a bash shell
#$ -S /bin/bash
# Send email notification when job (e)nds or is (a)borted
#$ -M myEmail@my.account.domain
#$ -m ea
# Optional arguments
# Specifiy a named queue and node:
#$ -q all.q@computer-0-1.local
#Specify the maximum required memory per process
#$ -l h_vmem=10G
# IMPORTANT: source your bash shell profile
source ~/.bash_profile
#Run this command:
paup myAnalysis.nex
All lines beginning with #$ are interpreted as commands by the SGE
queue, those beginning with # are comments.
The above script, if in the file named 'mySub.sh', would be
submitted to the queue using the following command:
* ``[user@gyra ~]$ qsub mySub.sh``
`SGE Users Manual `_.
.. _submitting_parallel_jobs:
Submitting parallel jobs
~~~~~~~~~~~~~~~~~~~~~~~~
Some software is parallelised and able to run a single job on
muliple processors. To make this work correctly you need to
indicate the number of CPU's to the software *AND* tell the queue
how many slots the job will use.
Submission script requesting 2 CPU's (-np 2) with the software
command "mb runfile.nex":
::
#! /bin/bash
#$ -cwd
[...]
mpirun -np 2 mb runfile.nex
If the above submission script is called "mbsub.sh", when
submitting to the queue request the "orte" parallel environment
("-pe orte") and 2 slots:
* ``[user@gyra ~]$ qsub -pe orte 2 mbsub.sh``
Mira assemblies: if you indicate SK:not=4 in your Mira command line
(ie 4 threads for the SKIM algorithm), submit the job as follows:
* ``[user@gyra ~]$ qsub -pe orte 4 my_mira.sh``
Yet other programmes (e.g. codonPHYML) have OpenMP support available which will
dynamically adjust the number of threads available. Consequently, the number of
threads must be limited to the number of CPUs requested in the parallel
environent: this is done be setting the env variable 'OMP_NUM_THREADS' in the
submission script like so for 12 threads::
[...]
#Parallel environment
#$ -pe orte 12
export OMP_NUM_THREADS=12
Monitoring the queue
--------------------
The command 'qstat' describes the current status of the queue:
::
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@compute-0-1.local BIP 0/8/8 9.00 lx26-amd64
hc:h_vmem=275.720M
23194 0.55167 smlbrF cymon r 07/10/2015 11:29:34 8
---------------------------------------------------------------------------------
all.q@compute-0-2.local BIP 0/31/32 31.04 lx26-amd64
hc:h_vmem=0.000
22502 0.59167 dpplmbrc1 cymon r 06/09/2015 10:05:18 14
22505 0.59167 dppmbrc1 cymon r 06/09/2015 10:11:48 14
23391 0.60500 muitos_2he leonor r 07/28/2015 13:27:33 3
---------------------------------------------------------------------------------
all.q@compute-0-3.local BIP 0/15/16 15.03 lx26-amd64
hc:h_vmem=1.540G
23379 0.50500 montipora regina r 07/27/2015 12:46:33 1
23384 0.59167 tmlbr1 cymon r 07/27/2015 15:33:48 14
---------------------------------------------------------------------------------
all.q@compute-0-4.local BIP 0/27/32 27.06 lx26-amd64
hc:h_vmem=7.451G
22504 0.59167 dpplmbrc2 cymon r 06/09/2015 10:10:18 14
23391 0.60500 muitos_2he leonor r 07/28/2015 13:27:33 13
---------------------------------------------------------------------------------
all.q@compute-0-5.local BP 0/0/16 4.00 lx26-amd64 d
hc:h_vmem=84.415G
---------------------------------------------------------------------------------
class@compute-0-5.local BIP 0/0/12 4.00 lx26-amd64
hc:h_vmem=84.415G
---------------------------------------------------------------------------------
assembly@compute-0-5.local BP 0/4/16 4.00 lx26-amd64 d
hc:h_vmem=84.415G
23046 0.50500 spCV4r1 cymon r 07/06/2015 11:33:49 1
23049 0.50500 spCV4r2 cymon r 07/06/2015 11:51:34 1
23051 0.50500 spCV8r1 cymon r 07/06/2015 11:52:49 1
23052 0.50500 spCV8r2 cymon r 07/06/2015 11:53:04 1
---------------------------------------------------------------------------------
head@gyra.local BP 0/2/12 1.33 lx26-amd64
hc:h_vmem=22.077G
23381 0.50500 p4Sr2 cymon r 07/27/2015 14:35:18 1
23393 0.50500 blastdb cymon r 07/28/2015 15:31:48 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
23392 0.50500 mramos miguel qw 07/28/2015 14:42:02 1
* ``[user@gyra ~]$ qstat -u `` - display only those jobs for 'username'
* ``[user@gyra ~]$ qstat -j `` - display details of a particular 'jobnumber' in the queue
Deleting jobs from the queue
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Issuing the commands:
* ``[user@gyra ~]$ qdel -u `` - will delete all jobs of the user 'username'
* ``[user@gyra ~]$ qdel `` - will delete 'jobnumber' from the queue
.. _ngs:
Next-generation sequence assembly
---------------------------------
Due to the large volumes of data, NGS assembly can require large computation
resources, especially RAM memory. Consequently, NGS assembly jobs run within a
restricted environment in a special queue called *assembly*.
Note: In order for users to use the *assembly* queue, the user must request that
they be added to the access group for the queue.
**Note that assembly jobs run in the *all.q* will be summarily killed.**
Software
--------
- `SEQanswers software wiki `_
- `Broad Institute Software list `_
The following software is available on the cluster (in no
particular order):
Motif and pattern searching
~~~~~~~~~~~~~~~~~~~~~~~~~~~
`NCBI BLAST (2.2.21(legacy blastall) and 2.3.0+) `_
Other `BLAST databases `_ and custom databases on request.
`BLAST+ documentation. `_
- nt
- nr
- refseq\_genomic
- refseq\_protein
- refseq\_rna
- swissprot
- taxdb
- bonyfish_proteins (56,707 records)
- plastid_proteins (83,168 records)
- plastid_genomes (511 records)
`MpiBLAST (1.6.0) `_ See :ref:`submitting_parallel_jobs`
mpiBLAST is a freely available, open-source, parallel implementation of NCBI
BLAST. By efficiently utilizing distributed computational resources through
database fragmentation, query segmentation, intelligent scheduling, and
parallel I/O, mpiBLAST improves NCBI BLAST performance by several orders of
magnitude while scaling to hundreds of processors.
- nr
- nt
- bonyfish_proteins (56,707 records)
- plastid_proteins (83,168 records)
(other db's available on request)
`HMMER (3.1b2 and 2.3.2) `_
HMMER is used for searching sequence databases for homologs of protein
sequences, and for making protein sequence alignments. It implements methods
using probabilistic models called "profile hidden Markov models" (profile
HMMs). `Documentation <_static/Hmmer3_Userguide.pdf>`_
`Infernal (1.1rc3) `_
Infernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities.
`Documentation <_static/Infernal-1.1rc3_Userguide.pdf>`_
`TAMO (1.0_120321) `_
TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs.
`SignalP (4.1c) `_
SignalP 4.1 server predicts the presence and location of signal peptide
cleavage sites in amino acid sequences from different organisms:
Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The
method incorporates a prediction of cleavage sites and a signal
peptide/non-signal peptide prediction based on a combination of several
artificial neural networks.
`TMHMM (2.0c) `_
Prediction of transmembrane helices in proteins.
`RNAmmer (1.2) `_
The RNAmmer 1.2 server predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences.
`MEME-suite (4.10.2) `_
Motif-based sequence analysis tools.
Multiple and pair-wise sequence alignment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`ClustalW (1.1.18 (clustalw) and 2.0.12 (clustalw2)) `_
ClustalW2 is a general purpose multiple sequence alignment program for DNA
or proteins.
`T\_Coffee (8.14) `_
A collection of tools for computing, evaluating and manipulating multiple
alignments of DNA, RNA, protein sequences and structures.
`Muscle (3.8.31) `_
"Faster and more accurate than CLUSTALW"...
`Uclust (1.0.50 and 1.2.22q (Qiime)) `_
"Search and clustering hundreds of times faster than BLAST"...
`Usearch (5.2.32) `_
USEARCH is a unique high-throughput sequence analysis tool. It supports a
`variety of algorithms `_ for
sequence searching, clustering, and filtering
`Vsearch (1.1.3) and vsearch-bz (bzip2 compression version) `_
The aim of this project is to create an alternative to the USEARCH tool.
vsearch-bz can directly read input query and database files that are compressed in bzip2 format.
`Mafft (6.833b) `_
MAFFT is a multiple sequence alignment program for unix-like operating
systems. It offers a range of multiple alignment methods, L-INS-i (accurate;
for alignment of greather than 200 sequences), FFT-NS-2 (fast; for alignment
of greather than 10,000 sequences), etc.
`GBlocks (0.91b) `_
Selection of conserved blocks from multiple alignments for their use in
phylogenetic analysis.
`Exonerate (2.2.0) `_
Exonerate is a generic tool for pairwise sequence comparison.
`TranslatorX `_
A perl script for nucleotide sequence alignment and alignment cleaning based
on amino acid information. Uses ReadSeq and GBlocks. Alignments via Muscle,
Clustalw, MAFFT, and T-Coffee.
`Blat `_
BLAT (the BLAST-Like Alignment Tool) is a software program developed by Jim
Kent at UCSC to identify similarities between DNA sequences and protein
sequences.
`SEED `_
SEED is a software for clustering large sets of Next Generation Sequences
(NGS) with hundreds of millions of reads in a time and memory efficient
manner. Its algorithm joins highly similar sequences into clusters that can
differ by up to three mismatches and three overhanging residues.
`Article. `_
`LASTZ (1.02.00) `_
LASTZ is a program for aligning DNA sequences, a pairwise aligner.
Originally designed to handle sequences the size of human chromosomes and
from different species, it is also useful for sequences produced by NGS
sequencing technologies such as Roche 454. `Documentation
`_
`GMAP (13-02-15) `_
A Genomic Mapping and Alignment Program for mRNA and EST Sequences.
Population genetics / coalescent
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`PopABC (1.0) `_
PopABC is an Approximate Bayesian Computation (ABC) method to estimate
historical demographic parameters (e.g. population size, migration rate,
mutation rate, recombination rate, splitting events) within a Isolation with
migration (IM) population model.
`Migrate-n `_
Estimation of population sizes and gene flow using the coalescent.
`Lamarc (2.1.6) `_
LAMARC is a program which estimates population-genetic parameters such as
population size, population growth rate, recombination rate, and migration
rates. It approximates a summation over all possible genealogies that could
explain the observed sample, which may be sequence, SNP, microsatellite, or
electrophoretic data. LAMARC and its sister program Migrate are successor
programs to the older programs Coalesce, Fluctuate, and Recombine, which are
no longer being supported.
`Documentation `_
`IMa2 (8/26/2011) `_
The program implements a method for generating posterior probabilities for
complex demographic population genetic models. IMa2 works similarly to the
older IMa program, with some important additions. IMa2 can handle data and
implement a model for multiple populations (for numbers of sampled
populations between one and ten) – not just two populations (as was the case
with the original IM and IMa programs).
`Bayesian Phylogenetics and Phylogeography (BPP; vers. 3) `_
Coalescent analysis on a species tree (BP&P and 3s)
`Documentation <_static/bppDOC.pdf>`_
`MP-EST (1.5) `_
The MP-EST method estimates species trees from a set of gene trees by maximizing
a pseudo-likelihood function. The program is written in C. The parallel version
of the program can run independent searches (chains) in parallel. Each chain
starts with a different seed. The program will find the estimate of the species
tree with the largest likelihood score across chains. To use MP-EST, you need to
create two files; a gene tree file and a control file.
`Documentation <_static/MPEST-Manual-1.5.pdf>`_
`ASTRAL (github) `_
ASTRAL is a Java program for estimating a species tree given a set of
unrooted gene trees. ASTRAL is statistically consistent under multi-species
coalescent model (and thus is useful for handling ILS). It finds the tree
that maximizes the number of induced quartet trees in the set of gene trees
that are shared by the species tree.
`Documentation <_static/astral-tutorial.pdf>`_
`Structurama2 `_
Structurama is a program for inferring population structure from genetic
data. The program assumes that the sampled loci are in linkage equilibrium
and that the allele frequencies for each population are drawn from a
Dirichlet probability distribution.
`Documentation `_
`ANGSD `_
ANGSD is a software for analyzing next generation sequencing data.
`BayeScan (2.1) `_
BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations.
Phylogenetic analyses
~~~~~~~~~~~~~~~~~~~~~
`MrBayes MPI (3.2.6) `_ and `MrBayes v3.2-svn(r517- development version) `_
with `Beagle-lib `_
MrBayes is a program for the Bayesian estimation of phylogeny.
(See :ref:`submitting_parallel_jobs`)
`Documentation <_static/Manual_MrBayes_v3.2.0.pdf>`_
`Phyml (dev) `_
A simple, fast, and accurate algorithm to estimate large phylogenies by
maximum likelihood. Guindon S., Gascuel O. Systematic Biology,
52(5):696-704, 2003.
`Documentation <_static/PhyML-3.1_manual.pdf>`_
`RAxML (7.0.4) `_ and `(7.8.4 - github 4th Nov 2013) `_
Maximum likelihood estimation of phylogeneies.
(See :ref:`submitting_parallel_jobs`).
`Documentation (7.0.4) <_static/RAxML-Manual.7.0.4.pdf>`_
`Documentation (newest vers 8+) `_
`About parallelization. `_
`ExaML (1.0.0) `_
Exascale Maximum Likelihood (ExaML) code for phylogenetic inference using MPI.
This code implements the popular RAxML search algorithm for maximum
likelihood based inference of phylogenetic trees. It uses a radically new
MPI parallelization approach that yields improved parallel efficiency, in
particular on partitioned multi-gene or whole-genome datasets.
`Documentation (7.0.4) <_static/ExaML.pdf>`_
`FastTree / FastTreeMP (2.1.5) `_
FastTree infers approximately-maximum-likelihood phylogenetic trees from
alignments of nucleotide or protein sequences. FastTree can handle
alignments with up to a million of sequences in a reasonable amount of time
and memory. For large alignments, FastTree is 100-1,000 times faster than
PhyML 3.0 or RAxML 7.
If using FastTreeMP set the enviroment variable OMP_NUM_THREADS in the
submission script `Documentation `_
`qmmraxmlHPC (1.0) `_
Uses a class-frequency (cF) mixture model to model site-specific
distributions for phylogenetic inference.
`Phylip (3.6.8) `_
All things Felsenstein.
`Phylobayes (3.3f) `_
PhyloBayes is a Bayesian Monte Carlo Markov Chain (MCMC) sampler for
phylogenetic reconstruction using protein alignments. Compared to other
phylogenetic MCMC samplers, the main distinguishing feature of PhyloBayes is
the underlying probabilistic model, CAT (Lartillot and Philippe, 2004). CAT
is a mixture model especially devised to account for site-specific features
of protein evolution. It is particularly well suited for large multigene
alignments, such as those used in phylogenomics.
`Documentation `_
`PhyloBayes MPI (1.5a) `_
PhyloBayes-MPI is a Bayesian Markov chain Monte Carlo (MCMC) sampler for phyloge-
netic inference exploiting a message-passing-interface system for multi-core computing.
`Documentaton `_
`NH Phylobayes (0.2.3) `_
A Bayesian compound stochastic process for modeling nonstationary and
nonhomogeneous sequence evolution.
`Some notes on running nhpb `_
.. _p4:
`p4 (1.0) (python2.7.9 : numpy-1.9.1 : gsl-1.14) `_
P4 is a Python package that does maximum likelihood and Bayesian
phylogenetic analyses on molecular sequences. It's specialty is that you can
use heterogeneous models, where the model parameters can differ in different
parts of the tree, or over different parts of the data.
Includes `Qdist `_ module - `Installation `_
`Documentation `_
`DendroPy (3.12.0) `_
DendroPy is a Python library for phylogenetic computing. It provides classes
and functions for the simulation, processing, and manipulation of
phylogenetic trees and character matrices, and supports the reading and
writing of phylogenetic data in a range of formats, such as NEXUS, NEWICK,
NeXML, Phylip, FASTA, etc. Application scripts for performing some useful
phylogenetic operations, such as data conversion and tree posterior
distribution summarization, are also distributed and installed as part of
the libary. DendroPy can thus function as a stand-alone library for
phylogenetics, a component of more complex multi-library phyloinformatic
pipelines, or as a scripting “glue” that assembles and drives such
pipelines. `Tutorial `_
`BayesTraits `_
BayesTraits is a computer package for performing analyses of trait evolution
among groups of species for which a phylogeny or sample of phylogenies is
available. This new package incoporates our earlier and separate programes
Multistate, Discrete and Continuous. BayesTraits can be applied to the
analysis of traits that adopt a finite number of discrete states, or to the
analysis of continuously varying traits. Hypotheses can be tested about
models of evolution, about ancestral states and about correlations among
pairs of traits.
`BayesPhylogenies (1.0) `_
BayesPhylogenies is a general package for inferring phylogenetic trees using
Bayesian Markov Chain Monte Carlo (MCMC) or Metropolis-coupled Markov chain
Monte Carlo (MCMCMC) methods. The program allows a range of models of gene
sequence evolution, models for morphological traits, models for rooted
trees, gamma and beta distributed rate-heterogeneity, and implements a
'mixture model' (Pagel and Meade, 2004) that allows the user to fit more
than one model of sequence evolution, without partitioning the data.
`Beast (2.1.3) `_ with `Beagle-lib `_
BEAST is a cross-platform program for Bayesian MCMC analysis of molecular
sequences. It is entirely orientated towards rooted, time-measured
phylogenies inferred using strict or relaxed molecular clock models. It can
be used as a method of reconstructing phylogenies but is also a framework
for testing evolutionary hypotheses without conditioning on a single tree
topology. BEAST uses MCMC to average over tree space, so that each tree is
weighted proportional to its posterior probability.
`Installation and use of Beagle-lib with Beast <_static/beast_beagle_notes.html>`_
Plugins: `SNAPP (1.1.5) `_
`Templates `_
`StarBEAST tutorial `_
`BUG in *BEAST `_
`Modelgenerator (85) `_
ModelGenerator is a model selection program that selects optimal amino acid
and nucleotide substitution models from Fasta or Phylip alignments.
ModelGenerator supports 56 nucleotide and 96 amino acid substitution models.
`Jmodeltest (0.1.1) `_
jModelTest is a tool to carry out statistical selection of best-fit models
of nucleotide substitution. It implements five different model selection
strategies: hierarchical and dynamical likelihood ratio tests (hLRT and
dLRT), Akaike and Bayesian information criteria (AIC and BIC), and a
decision theory method (DT). It also provides estimates of model selection
uncertainty, parameter importances and model-averaged parameter estimates,
including model-averaged phylogenies.
`MrModeltest2 (2.3) `_
C program for selecting DNA substitution models using PAUP\*.
`ModelOMatic (1.0) `_
ModelOMatic is a C++ program designed for rapid phylogenetic model selection
on protein coding genes. Please see the manual for details of the program
and its settings. `Documentation `_.
`Examples `_.
`Garli (vers 1.0 and vers 2.0 MPI) `_
GARLI (Genetic Algorithm for Rapid Likelihood Inference) performs
phylogenetic searches on aligned nucleotide, codon and amino acid data sets
using the maximum likelihood criterion. On a practical level, the program is
able to perform maximum-likelihood tree searches on large data sets in a
number of hours.
`Prottest (2.4) `_
PROTTEST (ModelTest's relative) is a program for selecting the model of
protein evolution that best fits a given set of sequences (alignment). This
java program is based on the Phyml program (for maximum likelihood
calculations and optimization of parameters) and uses the PAL library as
well. Models included are empirical substitution matrices (such as WAG, LG,
mtREV, Dayhoff, DCMut, JTT, VT, Blosum62, CpREV, RtREV, MtMam, MtArt, HIVb,
and HIVw) that indicate relative rates of amino acid replacement, and
specific improvements (+I:invariable sites, +G: rate heterogeneity among
sites, +F: observed amino acid frequencies) to account for the evolutionary
constraints impossed by conservation of protein structure and function.
ProtTest uses the Akaike Information Criterion (AIC) and other statistics
(AICc and BIC) to find which of the candidate models best fits the data at
hand.
`PAUP (4.0b10) `_
Needs no introduction. - 4.0 final release expected any day now...
`PAML (4.6) `_
PAML is a package of programs for phylogenetic analyses of DNA or protein
sequences using maximum likelihood. It is maintained and distributed for
academic use free of charge by Ziheng Yang. ANSI C source codes are
distributed for UNIX/Linux/Mac OSX, and executables are provided for MS
Windows. PAML is not good for tree making. It may be used to estimate
parameters and test hypotheses to study the evolutionary process, when you
have reconstructed trees using other programs such as PAUP\*, PHYLIP,
MOLPHY, PhyML, RaxML, etc.
`Tree-Puzzle (5.2) `_
TREE-PUZZLE is a computer program to reconstruct phylogenetic trees from
molecular sequence data by maximum likelihood. It implements a fast tree
search algorithm, quartet puzzling, that allows analysis of large data sets
and automatically assigns estimations of support to each internal branch.
TREEPUZZLE also computes pairwise maximum likelihood distances as well as
branch lengths for user specified trees. Branch lengths can be calculated
with and without the molecular-clock assumption. In addition, TREE-PUZZLE o
ers likelihood mapping, a method to investigate the support of a
hypothesized internal branch without computing an overall tree and to
visualize the phylogenetic content of a sequence alignment. TREE-PUZZLE
also conducts a number of statistical tests on the data set (chi-square test
for homogeneity of base composition, likelihood ratio to test the clock
hypothesis, one and two-sided Kishino-Hasegawa test, Shimodaira-Hasegawa
test, Expected Likelihood Weights).
`Consel (0.20) `_
CONSEL is a program package consists of small programs written in C
language. It calculates the probability value (i.e., p-value) to assess the
confidence in the selection problem. Although CONSEL is applicable to any
selection problem, it is mainly designed for the phylogenetic tree
selection. CONSEL calculates the p-value using several testing procedures;
the bootstrap probability, the Kishino-Hasegawa test, the
Shimodaira-Hasegawa test, and the weighted Shimodaira-Hasegawa test. In
addition to these conventional tests, CONSEL calculates the p-value based on
the approximately unbiased test using the multi-scale bootstrap technique.
`Documentation <_static/consel.pdf>`_
`Tree Congruence Tester (tct.py) <_static/tct.html>`_ - (Requires :ref:`p4 `)
Tree Congruence Test(er): reads two rooted trees (NEXUS and/or PHYLIP
format), reciprocally prunes each tree of missing taxa (automatic), deletes
any taxa passed to the programme (via -d), and tests topological congruence
among the remaining clades supported by a value greater than the set
threshold.
`combineNexus.py <_static/combineNexus.html>`_ - (Requires :ref:`p4 ` and >= Python 2.7)
Reads Nexus formatted matrices and combines them into a single matrix with
blank sequences for genes that are missing from individual matrices.
`calculatePhylogeneticDiversity.py <_static/calculatePhylogeneticDiversity.html>`_ - (Requires :ref:`p4 ` and >= Python 2.7)
Calculate the Phylogenetic Diversity (PD: Faith 1992) of a group of taxa on
a tree. PD is the minimum total length of all the phylogenetic branches
required to span a given set of taxa on the phylogenetic tree (and does not
include the stem branch of a clade).
`makeConsensusTree.py <_static/makeConsensusTree.html>`_ - (Requires :ref:`p4 ` and >= Python 2.7)
Write a majority rule consensus tree from 1 or more Nexus or Newick
formatted tree files. Each tree file can have a specified burnin and/or step
count. The data file (in Phylip or Nexus format) from which the trees are
derived must be supplied.
`Update_BEAST_operators.py <_static/Update_BEAST_operators.html>`_
Update the BEAST operator tuning values in the XML run file based on the
suggestions output at the end of a previous MCMC in the log file. Only BEAST
version 2.3.0 supported.
`Concaterpillar (1.4) `_
- (Requires SCIPY :ref:`SciPy ` and `pyMPI `_)
A hierarchical likelihood ratio test for phylogenetic congruence.
`Documentation `_
(See :ref:`submitting_parallel_jobs`)
`minmax-chisq `_
Reduced amino acid alphabets for phylogenetic inference.
`Documentation `_
`Crux (1.2.0) `_
Crux is a software toolkit for molecular phylogenetic inference. Incl:
Bayesian Markov chain Monte Carlo (MCMC) methods (with Metropolis coupling
and MPI support for parallel computation) can sample among non-nested models
using reversible model jumps. Polytomous trees can be sampled, also via
reversible jumps. In fact, every non-essential model parameter that Crux's
MCMC implementation estimates can be expunged via reversible jumps.
`Notes on running Crux. `_
`Figtree (1.3.1) `_
FigTree is designed as a graphical viewer of phylogenetic trees and as a
program for producing publication-ready figures.
`Misfits (1.0) `_
MISFITS is a program to evaluate the goodness of fit of a model to an
alignment in phylogeny reconstruction.
`Documentation `_
`Compass (1.0) `_
COMPASS is a UNIX program for identifying and removing fast evolving sites from
morphological or molecular data matrices using a number of compatibility-based
methods.
`Documentation `_
`AIS: Almost Invariant Sets `_
The goal is to identify sets of amino acids with a high probability of
change between elements of the set but small probability of change between
different sets by using amino acid replacement matrices and their
eigenvectors. After identification of the subsets the quality of the
partition is assessed with a conductance measure.
`Documentation `_
`TreeFinder `_
TREEFINDER computes phylogenetic trees from molecular sequences. The program
infers even large trees by maximum likelihood under a variety of models of
sequence evolution. `Documentation `_
`LineageSpecificSeqgen (timestamp: Mar 16 2011) `_
LineageSpecificSeqgen is an extension to the seq-gen program that allows
generation of sequences with both changes in the proportion of variable
sites and changes in the rate at which sites switch between being variable
and invariable.
`INDELible (1.03) `_
INDELible is a new, portable, and flexible application for biological
sequence simulation that combines many features in the same place for the
first time. Using a length-dependent model of indel formation it can
simulate evolution of multi-partitioned nucleotide, amino-acid, or codon
data sets through the processes of insertion, deletion, and substitution in
continuous time. Nucleotide simulations may use the general unrestricted
model or the general time reversible model and its derivatives, and
amino-acid simulations can be conducted using fifteen different empirical
rate matrices. Substitution rate heterogeneity can be modelled via the
continuous and discrete gamma distributions, with or without a proportion of
invariant sites. INDELible can also simulate under non-homogenous and
non-stationary conditions where evolutionary models are permitted to change
across a phylogeny. Unique among indel simulation programs, INDELible
offers the ability to simulate using codon models that exhibit
nonsynonymous/synonymous rate ratio heterogeneity among sites and/or
lineages. `Documentation
`_
`PHASE (2.0) `_
This package is designed specifically for use with RNA sequences that have a
conserved secondary structure, e.g., rRNA and tRNA. It is well known that
compensatory substitutions occur in the paired regions of RNA secondary
structures; this means that substitutions occurring on one side of a pair
are correlated with substitutions on the other side. Most phylogenetic
programs assume that each site in a molecule evolves independently of the
others but this assumption is not valid for RNA genes.
`Documentation
`_
`BEST (2.3.1) `_
BEST is a free phylogenetics program written by Liang Liu to estimate the
joint posterior distribution of gene trees and species tree using multilocus
molecular data that accounts for deep coalescence but not for other issues
such as horizontal transfer or gene duplication.
`CodonPHYML (1.00 201306.18) `_
CodonPhyML uses Markovian codon models of evolution in phylogeny
reconstruction. Given a set of species characterized by their DNA sequences
as input, codonPhyML will return the phylogenetic tree that best describes
their evolutionary relationship. OMP support, no BLAS/LAPACK.
`Documentation <_static/codonPhyML_Manual.pdf>`_
`SymTest `_
R functions: Test for Symmetry of Matched DNA Sequences, Overall Test for Marginal Symmetry
`fastCodeML `_
FastCodeML is a software to infer positive selection along positions of a
protein coding gene using the Branch-Site model of evolution. By using an
hybrid (OpenMP/MPI) strategy, FastCodeML can reach a speed-up of up to 10
times compared with codeml.
`PhyML-4X `_
LG4X: Modeling Protein Evolution with Several Amino-Acid Replacement
Matrices Depending on Site Rates
`DPPDiv (1.0b) `_
DPPDiv is a program for estimating species divergence times and
lineage-specific substitution rates on a fixed topology. The prior on branch
rates is a Dirichlet process prior which clusters branches into distinct
rate classes.
`PLL-DPPDIV `_
We present a substantially improved and parallelized version of DPPDiv, a
software tool for estimating species divergence times and lineage-specific
substitution rates on a fixed tree topology.
`FDPPDiv `_
Fossilised Birth-Death Model of T. Heath - development version of DPPDiv
`RogueNaRok `_
A versatile and scalable algorithm for rogue taxon identification.
`ccdprobs `_
Software to estimate probabilities of tree topologies using conditional
clade distributions.
`BUCKy `_
BUCKy estimates the dominant history of sampled individuals, and how much of
the genome supports each relationship, using Bayesian concordance analysis.
BUCKy does not assume that genes (or loci) all have the same topology.
Instead, groups of genes sharing the same tree are detected (while
accounting for uncertainty in gene tree estimates), and then combined to
gain more resolution on their common tree. No assumption is made regarding
the reason for discordance among gene trees. `Documentation `_
Genome assembly
~~~~~~~~~~~~~~~
Roche 454 (FLX Titanium)
++++++++++++++++++++++++
`Roche 454 Data Analysis suite (2.9_All_20130530_1559) `_
Obtain biologically meaningful results from your sequence data quickly and
affordably with the powerful suite of analysis tools provided with the
Genome Sequencer FLX System, updated for the GS FLX Titanium series. AKA
Newbler *et al.* (See :ref:`NGS assembly `)
Documentation:
- `SWManual-v2.9_Overview.pdf `_
- `SWManual-v2.9_PartB.pdf `_
- `SWManual-v2.9_PartC.pdf `_
- `SWManual-v2.9_PartD.pdf `_
`Octupus (0.1.1) `_
OCTUPUS uses a novel method of sequence clustering and pairwise comparisons
which reduces the influence of chimeras and intraspecific diversity on
cluster generation. The clustering approach used to generate OCTUs is
designed with the intent to reflect the expected pattern of diversity of
rDNA genes. Additionally, OCTUPUS provides a method to screen clusters for
evidence of chimera formation without the use of a reference database.
OCTUPUS is optimized for speed, and does not require a computing cluster to
analyze typical large scale datasets.
`PRICE (1.2) `_
PRICE (Paired-Read Iterative Contig Extension), a de novo genome assembler
implemented in C++. Its name describes the strategy that it implements for
genome assembly: PRICE uses paired-read information to iteratively increase
the size of existing contigs. Initially, those contigs can be individual
reads from a subset of the paired-read dataset, non-paired reads from
sequencing technologies that provide non-paired data, or contigs that were
output from a prior run of PRICE or any other assembler.
`Documentation `_
Illumina (HiSeq 2000, GAIIx, GAIIe)
+++++++++++++++++++++++++++++++++++
`ABySS (1.5.2) `_
(MPI with `sparsehash `_ - max kmer 96)
ABySS is a de novo, parallel, paired-end sequence assembler that is designed
for short reads. The single-processor version is useful for assembling
genomes up to 100 Mbases in size. The parallel version is implemented using
MPI and is capable of assembling larger genomes. (See :ref:`NGS assembly `)
- the default abyss binaries are compiled with the default 64 max kmer value
A second compiled version of the binaries with --enable-maxk=32 can be
found at /share/apps/abyss32k/bin/* - these should be more memory
efficient
`Trans-ABySS (1.4.4) `_
Trans-ABySS is a software pipeline for analyzing ABySS-assembled contigs
from shotgun transcriptome data. The pipeline accepts assemblies that were
generated across a wide range of k values in order to address variable
transcript expression levels. It first filters and merges the multi-k
assemblies, generating a much smaller set of nonredundant contigs. It
contains scripts that map assembled contigs to known transcripts, currently
supporting Blat and Exonerate contig-to-genome aligners. It identifies novel
splicing events like exon-skipping, novel exons, retained introns, novel
introns, and alternative splicing. Its scripts can also estimate gene
expression levels, identify candidate polyadenylation sites, and identify
candidate gene-fusion events.
`Documentation `_
(See :ref:`NGS assembly `)
`SOAP `_
SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis
Package). It is an updated version of SOAP software for short
oligonucleotide alignment. The new program features in super fast and
accurate alignment for huge amounts of short reads generated by
Illumina/Solexa Genome Analyzer.
- soap2sam.pl
- 2bwt-builder
- SOAPdenovo2 (r240): A de novo short reads assembler.
- SOAPdenovo-127mer
- SOAPdenovo-63mer
- soap (SOAPaligner/soap2) (2.20): Short Oligonucleotide Analysis Package.
`SOAPdenovo-Trans (1.03) `_
De novo transcriptome assembler basing on the SOAPdenovo framework, adapt to
alternative splicing and different expression level among transcripts.The
assembler provides a more accurate, complete and faster way to construct the
full-length transcript sets.
`Minia (2.0.2) `_
Minia is a short-read assembler based on a de Bruijn graph, capable of
assembling a human genome on a desktop computer in a day. The output of
Minia is a set of contigs. Minia produces results of similar contiguity and
accuracy to other de Bruijn assemblers (e.g. Velvet).
THIS ASSEMBLER IS VERY MEMORY EFFICIENT! but *doesn't* use the pairing
information (but will accept the data)
`Documentation `_
`FAQ `_
`Maq (0.7.1) `_
Maq is a software that builds mapping assemblies from short reads generated
by the next-generation sequencing machines. It is particularly designed for
Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle
ABI SOLiD data.
`Documentation `_
`Velvet (1.2.10) `_
Velvet is a de novo genomic assembler specially designed for short read
sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino
and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near
Cambridge, in the United Kingdom. Velvet currently takes in short read
sequences, removes errors then produces high quality unique contigs. It then
uses paired-end read and long read information, when available, to retrieve
the repeated areas between contigs. MAXKMERLENGTH=64, LONGSEQUENCES=1
`Documentation <_static/Velvet_Manual.pdf>`_
(See :ref:`NGS assembly `)
`VelvetOptimiser.pl (2.2.4) `_
VelvetOptimiser is a multi-threaded Perl script for automatically
optimising the three primary parameter options (K, -exp_cov,
-cov_cutoff) for the Velvet de novo sequence assembler.
`Documentation `_
`Trinity (2.0.6) - RNA-Seq De novo Assembly `_
A novel method for the efficient and robust de novo reconstruction of
transcriptomes from RNA-seq data. Trinity combines three independent
software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially
to process large volumes of RNA-seq reads. Trinity partitions the sequence
data into many individual de Bruijn graphs, each representing the
transcriptional complexity at at a given gene or locus, and then processes
each graph independently to extract full-length splicing isoforms and to
tease apart transcripts derived from paralogous genes.
`ALLPATHS-LG (42816 - 5th Sept 2012) `_
Short read genome assembler from the Computational Research and Development group at the Broad Institute.
`Documentation <_static/AllPaths-LG_Manual.pdf>`_
Note that the libraries have to be of specific kinds... see documentation.
`Spades (3.5.0) `_
A new assembler for both single-cell and standard (multicell) assembly, and
demonstrate that it improves on the recently released E + V - SC assembler
(specialized for single-cell data) and on popular assemblers Velvet and
SoapDeNovo (for multicell data).
`Documentation `_
`MetaVelvet (1.2.01) `_
A short read assember for metagenomics.
MAXKMERLENGTH=64 CATEGORIES=4
`IDBA (1.1.1) `_
IDBA is a practical iterative De Bruijn Graph De Novo Assembler for sequence assembly in bioinfomatics.
`IDBA-UD (1.1.1) `_
IDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads
Sequencing data with Highly Uneven Sequencing Depth (as in metagenomics).
`IDBA-Tran (1.1.1) `_
IDBA-Tran is an iterative De Bruijn Graph De Novo short read assembler for
transcriptome. It is purely de novo assembler based on only RNA sequencing
reads.
`IDBA-Hybrid (1.1.1) `_
IDBA-Hybrid is an iterative De Bruijn Graph De Novo Assembler for hybrid
sequencing. It is an extension of IDBA-UD algorithm.
Hybrid 454 and Illumina
+++++++++++++++++++++++
`Celera (7.0) `_
Celera Assembler is a de novo whole-genome shotgun (WGS) DNA sequence
assembler. It reconstructs long sequences of genomic DNA from fragmentary
data produced by whole-genome shotgun sequencing
`Mira (4.0.2)(+3rd party) `_
The mira genome fragment assembler is a specialised assembler for sequencing
projects classified as 'hard' due to high number of similar repeats. For EST
transcripts, miraEST is specialised on reconstructing pristine mRNA
transcripts while detecting and classifying single nucleotide polymorphisms
(SNP) occuring in different variations thereof.
`Online wiki `_
`Documentation <_static/mira>`_ (See :ref:`NGS assembly `)
`The Definitive Guide to Mira3 <_static/DefinitiveGuideToMIRA.pdf>`_
Note that by default Mira uses 2 threads in the SKIM algorithm, so if no
more threads are requested, then typically you would submit a job requesting
2 slots in the queue ``-pe orte 2`` - see :ref:`submitting_parallel_jobs`
`Ray (1.6.0) `_
Ray is a parallel genome assembler utilizing MPI. Ray is a single-executable
program (the executable is Ray). Its aim is to assemble sequences on
MPI-enabled computers or clusters. Ray assembles reads obtained with new
sequencing technologies (Illumina, 454, SOLiD) using MPI 2.2 -- a message
passing inferface standard. (See :ref:`NGS assembly `)
MAXKMERLENGTH=32
`Oases (0.2.08) `_
Oases is a *de novo* transcriptome assembler designed to produce transcripts
from short read sequencing technologies, such as Illumina, SOLiD, or 454 in
the absence of any genomic assembly. MAXKMERLENGTH=64
`Documentation `_
(See :ref:`NGS assembly `)
.. _amos:
`AMOS (3.1.0) `_
Includes:
* `Bambus2 `_
* `Hawkeye `_
The AMOS consortium is committed to the development of open-source whole
genome assembly software. The project acronym (AMOS) represents our primary
goal -- to produce A Modular, Open-Source whole genome assembler.
Open-source so that everyone is welcome to contribute and help build
outstanding assembly tools, and modular in nature so that new contributions
can be easily inserted into an existing assembly pipeline. This modular
design will foster the development of new assembly algorithms and allow the
AMOS project to continually grow and improve in hopes of eventually becoming
a widely accepted and deployed assembly infrastructure. In this sense, AMOS
is both a design philosophy and a software system.
`Notes on installation. <_static/Amos_installation.html>`_
`LOCAS (0.1.7) and SUPERLOCAS (0.0.2) `_
LOCAS is a programm to assemble short reads of second generation sequencing
technologies. It explicitly handles low coverage data by allowing mismatches
in the overlap alignment of reads.
`Documentation `_
Sequence read manipulation / cleaning / error corrections (kmer analysis)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`sff_extract (0.3.0) `_
454 sequence reads are usually stored in sff files. In these files the
information about the reads is stored: sequece, quality and quality and
adapter clips. sff_extract extracts the reads from the sff files and
stores them into fasta and xml or caf text files.
`Smalt (0.4.1) `_
SMALT is a pairwise sequence alignment program designed for the efficient
mapping of DNA sequencing reads onto genomic reference sequences. Reads from
a range of sequencing platforms, for example Illumina-Solexa, Roche-454 or
ABI-Sanger, can be processed including paired-end reads.
`Documentation <_static/smalt-manual-0.4.2.pdf>`_
`Lucy (1.20) `_
Lucy is a program for DNA sequence quality trimming and vector removal. Its
purpose is to process DNA sequence data acquired from DNA sequencers to
prepare the data for downstream processing applications such as genome
assembly.
`Amplicon Noise (1.2.1) `_
AmpliconNoise is a collection of programs for the removal of noise from 454
sequenced PCR amplicons. It involves two steps the removal of noise from the
sequencing itself and the removal of PCR point errors. This project also
includes the Perseus algorithm for chimera removal.
`clean_reads (0.2.3) and NGS Backbone (1.4.0) `_
clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim:
- bad quality regions,
- adaptors,
- vectors, and
- regular expresssions.
It also filters out the reads that do not meet a minimum quality criteria
based on the sequence length and the mean quality. It uses several
algorithms and third party tools to carry out the cleaning. The third party
tools used are: lucy, blast, mdust and trimpoly. The functionality offered
by clean_reads is similar to the cleaning capabilities of the ngs_backbone
pipeline. In fact, both tools use the same code base and are just different
interfaces on top of a Python library called franklin.
Can be parallelised with `psubprocess `_.
`cutadapt (1.8.1) `_
cutadapt removes adapter sequences from high-throughput sequencing data.
This is usually necessary when the read length of the sequencing machine is
longer than the molecule that is sequenced, for example when sequencing
microRNAs.
`Documentation `_
`prinseq-lite (0.14.4) `_
PRINSEQ is a publicly available tool that is able to filter, reformat and
trim your sequences and to provide you summary statistics for your sequence
data. `Documentation `_
`DeconSeq `_
Sequences obtained from impure nucleic acid preparations may contain DNA
from sources other than the sample. Those sequence contaminations are a
serious concern to the quality of the data used for downstream analysis,
possibly causing misassembly of sequence contigs and erroneous conclusions.
Therefore, the removal of sequence contaminants presents a necessary step
for all metagenomic projects.
`FastQC (0.10.1) `_
A quality control tool for high throughput sequence data.
`ChimeraSlayer (MicrobiomeUtilities-r20110519) `_
A set of software utilities for processing and analyzing 16S rRNA genes
including generating NAST alignments, chimera checking, and assembling
paired 16S rRNA reads according to reference sequence homology.
`Trim Galore! (0.3.2) `_
Trim Galore! is a wrapper script to automate quality and adapter trimming as
well as quality control, with some added functionality to remove biased
methylation positions for RRBS sequence files (for directional,
non-directional (or paired-end) sequencing). `Documentation
`_
`Seq_filter.pl `_
Filters sequences?
`cutseq_fasta.pl `_
Takes a large fasta file and cuts a subset of sequences to make a second fasta file.
`Bedtools (2.20.1) `_
Collectively, the bedtools utilities are a swiss-army knife of tools for a
wide-range of genomics analysis tasks. The most widely-used tools enable
genome arithmetic: that is, set theory on the genome. For example, bedtools
allows one to intersect, merge, count, complement, and shuffle genomic
intervals from multiple files in widely-used genomic file formats such as
BAM, BED, GFF/GTF, VCF.
`Pairfq `_
Sync paired-end FASTA/Q files and keep singleton reads
`SOAPec (2.01) `_
The read correction package is a short-read correction tool and part of SOAPdenovo.
It is specially designed to correct Illum ina GA short reads.
`kmergenie (1.6972) `_
merGenie estimates the best k-mer length for genome de novo assembly. Given
a set of reads, KmerGenie first computes the k-mer abundance histogram for
many values of k. Then, for each value of k, it predicts the number of
distinct genomic k-mers in the dataset, and returns the k-mer length which
maximizes this number. Experiments show that KmerGenie's choices lead to
assemblies that are close to the best possible over all k-mer lengths.
`FAQ `_
Alignment, mapping, and scaffolding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`MUMmer (3.23) `_
MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. Dependency for :ref:`AMOS `.
`TopHat (2.0.13) `_
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq
reads to mammalian-sized genomes using the ultra high-throughput short read
aligner Bowtie, and then analyzes the mapping results to identify splice
junctions between exons.
`Documentation `_
`Installation `_
`Bowtie (1.1.1) `_
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short
DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp
reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to
keep its memory footprint small: typically about 2.2 GB for the human genome
(2.9 GB for paired-end).
`Documentation `_
(See :ref:`NGS assembly `)
`Bowtie2 (2.2.4) `_
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing
reads to long reference sequences. It is particularly good at aligning reads
of about 50 up to 100s or 1,000s of characters, and particularly good at
aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the
genome with an FM Index to keep its memory footprint small: for the human
genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports
gapped, local, and paired-end alignment modes.
`Manual `_
`Cufflinks (2.2.1) `_
Transcript assembly, differential expression, and differential regulation
for RNA-Seq. Cufflinks assembles transcripts, estimates their abundances,
and tests for differential expression and regulation in RNA-Seq samples. It
accepts aligned RNA-Seq reads and assembles the alignments into a
parsimonious set of transcripts. Cufflinks then estimates the relative
abundances of these transcripts based on how many reads support each one.
`Documentation `_
`CAP3 `_
CAP3 Sequence Assembly Program.
`Documentation `_
`BWA (0.7.8-r455) `_
BWA is a fast light-weighted tool that aligns relatively short sequences
(queries) to a sequence database (target), such as the human reference
genome.
`BFAST (0.6.4e) `_
BFAST facilitates the fast and accurate mapping of short reads to reference
sequences. Some advantages of BFAST include: \* Speed: enables billions of
short reads to be mapped quickly. \* Accuracy: A priori probabilities for
mapping reads with defined set of variants. \* An easy way to measurably
tune accuracy at the expense of speed. Specifically, BFAST was designed to
facilitate whole-genome resequencing, where mapping billions of short reads
with variants is of utmost importance. BFAST supports both Illumina and ABI
SOLiD data, as well as any other Next-Generation Sequencing Technology (454,
Helicos), with particular emphasis on sensitivity towards errors, SNPs and
especially indels. Other algorithms take short-cuts by ignoring errors,
certain types of variants (indels), and even require further alignment, all
to be the "fastest" (but still not complete). BFAST is able to be tuned to
find variants regardless of the error-rate, polymorphism rate, or other
factors.
`PYNAST (1.1) `_
PyNAST is a python implementation of the NAST sequence alignment tool.
`dDocent.FB `_
dDocent was designed as a scripted software pipeline made for analyzing
double digest RAD or ezRAD data in highly polymorphic marine species.
`SSPACE `_
SSPACE standard is a stand-alone program for scaffolding pre-assembled
contigs using NGS paired-read data. It is unique in offering the possibility
to manually control the scaffolding process. By using the distance
information of paired-end and/or matepair data, SSPACE is able to assess the
order, distance and orientation of your contigs and combine them into
scaffolds. `Manual <_static/F132-03_SSPACE_Standard_User_manual_v3.0.pdf>`_
`Tutorial <_static/F132-04_SSPACE_Standard_Tutorial_v3.0.pdf>`_
Annotation
~~~~~~~~~~
`Glimmer (3.02) `_
Glimmer is a system for finding genes in microbial DNA, especially the
genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and
Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to
identify the coding regions and distinguish them from noncoding DNA. (See :ref:`NGS assembly `)
`Blast2GO (2.3.5) `_
Command line version only: b2g4pipe, no visualisation.
Blast2GO is an ALL in ONE tool for functional annotation of (novel)
sequences and the analysis of annotation data.
`Documentation <_static/b2gpipe_readme_v2_3_5.txt>`_
`Configuration <_static/b2gPipe.properties>`_
`martyr.py (1.0) <_static/martyr.html>`_
A Python version of `MARTA `_.
Annotates BLAST (x or n) XML output with the NCBI taxonomy. Requires:
Biopython, blastdbcmd, NCBI taxonomy dump and the target BLAST DB in
$BLASTDB, and a BioSQL database in PostGreSQL with the NCBI taxonomy loaded.
`RDP Classifier `_
The RDP Classifier is a naive Bayesian classifier that can rapidly and
accurately provides taxonomic assignments from domain to genus, with
confidence estimates for each assignment. More information can be found at
`Ribosomal Database Project `_
`Prodigal (2.61) `_
Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a
microbial (bacterial and archaeal) gene finding program. Prodigal can run in
metagenomic mode and analyze sequences even when the organism is unknown.
`Prokka (1.7.2) `_
Prokka is a software tool for the rapid annotation of prokaryotic genomes. A
typical 4 Mbp genome can be fully annotated in less than 10 minutes on a
quad-core computer, and scales well to 64 core SMP systems. It produces
GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately
submitted to Genbank/DDJB/ENA.
`Documenation `_
`Trinotate (2.0.2) `_
Trinotate is a comprehensive annotation suite designed for automatic
functional annotation of transcriptomes, particularly de novo assembled
transcriptomes, from model or non-model organisms. Trinotate makes use of a
number of different well referenced methods for functional annotation
including homology search to known sequence data
(BLAST+/SwissProt/Uniref90), protein domain identification (HMMER/PFAM),
protein signal peptide and transmembrane domain prediction (singalP/tmHMM),
and comparison to currently curated annotation databases (EMBL Uniprot
eggNOG/GO Pathways databases). All functional annotation data derived from
the analysis of transcripts is integrated into a SQLite database which
allows fast efficient searching for terms with specific qualities related to
a desired scientific hypothesis or a means to create a whole annotation
report for a transcriptome.
Software specific databases in $BLASTDB: `SwissProt
`_
`Uniref90
`_
and `Pfam `_
`TransDecoder (2.1.0) `_
TransDecoder identifies candidate coding regions within transcript
sequences, such as those generated by de novo RNA-Seq transcript assembly
using Trinity, or constructed based on RNA-Seq alignments to the genome
using Tophat and Cufflinks.
Visualisation
~~~~~~~~~~~~~
`Integrated Genome Browser (6.7.0) `_
The Integrated Genome Browser (IGB, pronounced Ig-Bee) is an interactive,
zoomable, scrollable software program you can use to visualize and explore
genome-scale data sets, such as tiling array data, next-generation
sequencing results, genome annotations, microarray designs, and the sequence
itself.
`Documenation `_
`Tablet - Next Generation Sequence Assembly Visualization `_
Tablet is a lightweight, high-performance graphical viewer for next
generation sequence assemblies and alignments.
`Mauve (2.3.1) - Multiple Genome Alignment `_
Mauve is a system for efficiently constructing multiple genome alignments in
the presence of large-scale evolutionary events such as rearrangement and
inversion. Multiple genome alignment provides a basis for research into
comparative genomics and the study of evolutionary dynamics. Aligning whole
genomes is a fundamentally different problem than aligning short sequences.
`Documentation <_static/mauve_user_guide.pdf>`_
Bioinformatics suites and libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`Qiime (1.9.0) `_
QIIME (pronounced chime) stands for Quantitative Insights Into Microbial
Ecology. QIIME is an open source software package for comparison and
analysis of microbial communities, primarily based on high-throughput
amplicon sequencing data (such as SSU rRNA) generated on a variety of
platforms, but also supporting analysis of other types of data (such as
shotgun metagenomic data). QIIME takes users from their raw sequencing
output through initial analyses such as OTU picking, taxonomic assignment,
and construction of phylogenetic trees from representative sequences of
OTUs, and through downstream statistical analysis, visualization, and
production of publication-quality graphics. QIIME has been applied to single
studies based on billions of sequences from thousands of samples.
`Notes on installation. <_static/qiime_installation.html>`_
`Mothur (1.33.3) `_
Bioinformatics for the microbial ecology community. MPI version.
`PyCogent (1.5.1) `_
PyCogent: A toolkit for making sense from sequence. PyCogent includes
connectors to remote databases, built-in generalized probabilistic
techniques for working with biological sequences, and controllers for 3rd
party applications.
.. `Galaxy `_
Galaxy is an open, web-based platform for accessible, reproducible, and
transparent computational biomedical research.
`Installing Galaxy on ROCKS. <_static/Galaxy_installation_on_ROCKS.html>`_
For an account to use the local installation of Galaxy, email Cymon.
`Emboss (6.4.0) `_
EMBOSS is "The European Molecular Biology Open Software Suite".
`FASTA `_
The FASTA programs find regions of local or global (new) similarity between
Protein or DNA sequences, either by searching Protein or DNA databases, or
by identifying local duplications within a sequence. Other programs provide
information on the statistical significance of an alignment. Like BLAST,
FASTA can be used to infer functional and evolutionary relationships between
sequences as well as help identify members of gene families.
`Biopython (1.65) `_ - (Python2.7)
Biopython is a set of freely available tools for biological computation
written in Python.
`Documentation `_
`Wise2 (2.2.0) `_
Wise2 is a package focused on comparisons of bio polymers, commonly DNA
sequence and protein sequence. There are many other packages which do this,
probably the best known being BLAST package (from NCBI) and the FASTA
package (from Bill Pearson).
`NCL (2.1.14) `_
The NEXUS Class Library (NCL) is an integrated collection of C++ classes
designed to allow the user to quickly write a program that reads
NEXUS-formatted data files. It also allows easy extension of the NEXUS
format to include new blocks of your own design.
`R statistics (3.2.3) `_
- `Bioconductor `_
- `edgeR `_
- `deseq `_
- `phytools `_
- `NetCDF (ncdf) `_
- `Phybase `_ `Manual <_static/phybase1.4-manual.pdf>`_
- `ape `_
`Manual <_static/ape.pdf>`_
`List of installed packages `_
R is a language and environment for statistical computing and graphics.
Bioconductor provides tools for the analysis and comprehension of
high-throughput genomic data. EdgeR - differential expression analysis of
RNA-seq and digital gene expression profiles with biological replication.
Uses empirical Bayes estimation and exact tests based on the negative
binomial distribution. Also useful for differential signal analysis with
other types of genome-scale count data.
`BioPerl `_
BioPerl is an extensive set of bioinformatics libraries written
in Perl.
`Pysam (0.6) `_
Pysam is a python module for reading and manipulating Samfiles. It's a
lightweight wrapper of the samtools C-API.
`Samtools (1.1) `_
SAM (Sequence Alignment/Map) format is a generic format for storing large
nucleotide sequence alignments. SAM Tools provide various utilities for
manipulating alignments in the SAM format, including sorting, merging,
indexing and generating alignments in a per-position format.
.. _scipy:
`SciPy (0.9.0) `_
SciPy (pronounced "Sigh Pie") is open-source software for mathematics,
science, and engineering.The SciPy library depends on NumPy, which provides
convenient and fast N-dimensional array manipulation. The SciPy library is
built to work with NumPy arrays, and provides many user-friendly and
efficient numerical routines such as routines for numerical integration and
optimization.
`Installation with ATLAS and complete LAPACK library <_static/ATLAS_LAPACK_CENTOS5.4.text>`_
`NCBI Sequence Read Archive (SRA) Toolkit (2.0rc5) `_
Stuff from NCBI to manipulate SRAs. `Documentation `_
`FASTX-toolkit (0.0.14) `_
The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
- FASTQ-to-FASTA converter
Convert FASTQ files to FASTA files.
- FASTQ Information
Chart Quality Statistics and Nucleotide Distribution
- FASTQ/A Collapser
Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
- FASTQ/A Trimmer
Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).
- FASTQ/A Renamer
Renames the sequence identifiers in FASTQ/A file.
- FASTQ/A Clipper
Removing sequencing adapters / linkers
- FASTQ/A Reverse-Complement
Producing the Reverse-complement of each sequence in a FASTQ/FASTA file.
- FASTQ/A Barcode splitter
Splitting a FASTQ/FASTA files containning multiple samples
- FASTA Formatter
changes the width of sequences line in a FASTA file
- FASTA Nucleotide Changer
Convets FASTA sequences from/to RNA/DNA
- FASTQ Quality Filter
Filters sequences based on quality
- FASTQ Quality Trimmer
Trims (cuts) sequences based on quality
- FASTQ Masker
Masks nucleotides with 'N' (or other character) based on quality
`pyfasta (0.4.3) `_
Fast, memory-efficient, pythonic (and command-line) access to fasta sequence files.
`Bio++ (core-2.0.1) `_
Bio++ is a set of C++ libraries for Bioinformatics, including sequence
analysis, phylogenetics, molecular evolution and population genetics. Bio++
is fully Object Oriented and is designed to be both easy to use and computer
efficient.
`Installation <_static/bio++_installation.html>`_
`Documentation `_
`PICARD (1.85)