We will be offering mothur and R workshops throughout 2019. Learn more.

Mothur manual

From mothur
Revision as of 19:53, 12 January 2009 by Westcott (Talk | contribs)

Jump to: navigation, search

MOTHUR Manual

Introduction

MOTHUR is a computer program that uses a distance matrix as the input file and assigns sequences to operational taxonomic units for every distance level that can be used to form OTUs using either the nearest, furthest, or average neighbor clustering algorithms. These are also called, single-linkage, complete-linkage, and UPGMA, respectively. Once sequences are assigned to OTUs, the frequency data for each distance level is used to construct rarefaction and collector's curves for the number of species observed, Shannon's and Simpson's diversity index, and Chao1, ACE, Jackknife, and Bootstrap richness estimators as a function of sampling effort and the distance used to define an OTU. MOTHUR also uses non-parametric estimators to estimate similarity between communities based on membership and structure. MOTHUR determines the number individuals in each community that were sampled for each OTU. Next it calculates collector's curves for the fraction of shared OTUs between the two communities (with and without correcting for unsampled individuals), the Jaccard and Sorenson Indices, and the richness of OTUs shared between the two communities. Standard error values are calculated for entire sequence collection. MOTHUR is freely available as C++ source code and as a windows executable.

This manual is designed to achieve five goals:

  1.	Describe the difference between each of the three sequence assignment algorithms.
  2.	Show how to use MOTHUR
  3.	Describe output files and equations used
  4.	Validate output by making calculations by hand
  5.	Answer frequently asked questions

If you have any questions, complaints, or praise, please do not hesitate to contact Dr. Patrick D. Schloss at pschloss@microbio.umass.edu



Output Files

The MOTHUR program creates 33 output files. Although this may initially seem like a lot, you may not choose to output them all. For example, each richness estimator and diversity has it own output file. Mothur will only create a file for the richness estimators you choose to use. All files are tab delimitated and are easily imported in to Excel or your favorite spreadsheet. The user is encouraged to track down the original papers to better understand how they were derived. Most reprints are available online (try www.jstor.org first). MOTHUR produces three output files: *.list, *.rabund, and *.sabund from the cluster command. It produces a file for each estimator you choose to run with the collect, rarefaction, summary, collect.shared, rarefaction.shared or summary.shared commands. They are as follows *.collect, *.rarefaction, *.summary, *.chao, *.ace, *.jack, *.shannon, *.np_shannon, *.simpson, *.bootstrap, *.r_chao, *.r_ace, *.r_jack, *.r_shannon, *.r_npshannon, *.r_simpson, *.r_bootstrap, *.sharedChao, *.sharedAce, *.sharedJabund, *.SharedSorensonAbund, *.SharedJclass, *.SharedSorClass, *.SharedJest, *.SharedSorEst, *.SharedThetaYC, *.SharedThetaN, *.sharedobserved. *.sharedsummary. The shared command produces a *.shared file. The parselist command produces a *.list file for each group represented. I will explain what each file contains and then how the calculations were derived.

*.rabund, *.list

These files contain the number of sequences (*.rabund) and their identity (*.list) in each OTU as a function of distance. In the *.rabund file the first column contains the distance used to define an OTU, the second is the number of OTUs and the remaining columns tell the number of sequences in each OTU. The same information is contained in the *.list file except that instead of the number of sequences in each OTU DOTUR gives the name of each sequence in that OTU separated by commas.

*.sabund

This file contains data for constructing a rank-abundance plot of the OTU data for each distance level. The first column contains the distance and the second is the number of OTUs observed at that distance. The successive values in the row are the number of OTUs that were found once, twice, etc.

*.collect, *.chao, *.ace, *.jack, *.shannon, *.np_shannon, *.simpson, *.bootstrap, *.r_chao, *.r_ace, *.r_jack, *.r_shannon, *.r_npshannon, *.r_simpson, *.r_bootstrap. *.rarefaction

Data to construct collector’s curves for each comparison is provided in their corresponding file. The first line of the each file has a description of each column’s contents, first the number sampled, then each distance level. A row in the file is produced for each frequency level requested. Each row contains the collectors curve data.

*.sharedChao, *.sharedAce, *.sharedJabund, *.SharedSorensonAbund, *.SharedJclass, *.SharedSorClass, *.SharedJest, *.SharedSorEst, *.SharedThetaYC, *.SharedThetaN, *.sharedobserved.

Data to construct collector’s curves for each all the group comparisons are provided in the corresponding files. The first line of the each group’s comparison has a description of each column’s contents, first the number sampled, then each distance level and the two groups analysed. A row in the file is produced for each frequency level requested. Each row contains the collectors curve data for the 2 groups analysed.

*.summary

Data to construct collector’s curves for each comparison at the final distance level are provided in the *.summary file. The first line of the *.summary file has a description of each column’s contents. Each following row contains the end point of the collector’s curve for the given comparison.

*.sharedsummary

Data to construct collector’s curves for each comparison at the final distance level are provided in the *.sharedsummary file. The first line of the *.sharedsummary file has a description of each column’s contents. Each following row contains the distance level, the groups compared and the the end point of the collector’s curve for the given comparison.

*.shared

This file contains the frequency of sequences from each group found in each OTU. Each row consists of the distance being considered, group name, number of OTUS, and the abundance information separated by tabs. The abundance information is as follows. Each subsequent number represents a different OTU so that the number indicates the number of sequences in that group that clustered within that OTU. Note that OTU frequencies can only be compared within a distance definition.


Example Calculations

*.collect and *.r_rarefaction

These are the collector's curve and rarefaction curve data for the number of observed OTUs as a function of distance between sequences and the number of sequences sampled. This is merely a count of the number of OTUs observed at any given point in the sampling process.

By theory, the rarefaction curve should match the following expression:

<math>S_n=S_t-\left ( \frac{\sum_{i=1}^S_t {N - N_i\choose n} }{{N \choose n} } \right )</math>