We will be offering an R workshop December 18-20, 2019. Learn more.

Mothur manual

From mothur
Revision as of 19:52, 12 January 2009 by Westcott (Talk | contribs)

Jump to: navigation, search

MOTHUR Manual

Introduction

MOTHUR is a computer program that uses a distance matrix as the input file and assigns sequences to operational taxonomic units for every distance level that can be used to form OTUs using either the nearest, furthest, or average neighbor clustering algorithms. These are also called, single-linkage, complete-linkage, and UPGMA, respectively. Once sequences are assigned to OTUs, the frequency data for each distance level is used to construct rarefaction and collector's curves for the number of species observed, Shannon's and Simpson's diversity index, and Chao1, ACE, Jackknife, and Bootstrap richness estimators as a function of sampling effort and the distance used to define an OTU. MOTHUR also uses non-parametric estimators to estimate similarity between communities based on membership and structure. MOTHUR determines the number individuals in each community that were sampled for each OTU. Next it calculates collector's curves for the fraction of shared OTUs between the two communities (with and without correcting for unsampled individuals), the Jaccard and Sorenson Indices, and the richness of OTUs shared between the two communities. Standard error values are calculated for entire sequence collection. MOTHUR is freely available as C++ source code and as a windows executable.

This manual is designed to achieve five goals:

  1.	Describe the difference between each of the three sequence assignment algorithms.
  2.	Show how to use MOTHUR
  3.	Describe output files and equations used
  4.	Validate output by making calculations by hand
  5.	Answer frequently asked questions

If you have any questions, complaints, or praise, please do not hesitate to contact Dr. Patrick D. Schloss at pschloss@microbio.umass.edu


How to Run MOTHUR

To compile MOTHUR in LINUX type the following: >g++ mothur.C ≠O4 ≠o mothur MOTHUR is run from the command line prompt.

Mothur currently has 17 commands: read.phylip(), read.column(), read.list(), read.rabund(), read.sabund(), read.shared(), cluster(), collect(), collect.shared(), rarefaction(), rarefaction.shared(), summary(), summary.shared(), shared(), parselist(), help() and quit().


The read() commands:

The read.phylip or read.column command must be run before you execute the cluster command. The read.list, read.rabund or read.sabund command must be run before you can execute the collect, rarefaction or summary commands. The read.shared command must be run before you can execute the collect.shared, rarefaction.shared, summary.shared, or shared commands.

The read.phylip command is used to read a distance matrix file in phylip format. The read.phylip command parameter options are distfile, namefile, cutoff and precision. The read.phylip command should be in the following format: read.phylip(distfile=yourDistFile, namefile=yourNameFile, cutoff=yourCutoff, precision=yourPrecision). The distfile parameter is required. If you do not provide a cutoff value 10.00 is assumed. If you do not provide a precision value then 100 is assumed.

The read.column command is used to read a distance matrix file in column format. The read.column command parameter options are distfile, namefile, cutoff and precision. The read.column command should be in the following format: read.column (distfile=yourDistFile, namefile=yourNameFile, cutoff=yourCutoff, precision=yourPrecision). The distfile and namefile parameters are required. If you do not provide a cutoff value 10.00 is assumed. If you do not provide a precision value then 100 is assumed.

The read.list, read.rabund and read.sabund must be run before you execute a collect, rarefaction, summary command. Mothur will generate a .list, .rabund and .sabund upon completion of the cluster command or you may use your own.

The read.list command parameter options are listfile and orderfile. The read.list command should be in the following format: read.list(listfile=yourListFile, orderfile=yourOrderFile). The listfile parameter is required.

The read.sabund command parameter options are sabundfile and orderfile. The read.sabund command should be in the following format: read.sabund(sabundfile=yourSabundFile, orderfile=yourOrderFile). The sabundfile parameter is required.

The read.rabund command parameter options are rabundfile and orderfile. The read.rabund command should be in the following format: read.rabund(rabundfile=yourRabundFile, orderfile=yourOrderFile). The rabundfile parameter is required.

The read.shared must be run before you execute a shared, collect.shared, rarefaction.shared, summary.shared command. Mothur will generate a .list upon completion of the cluster command or you may use your own.

The read.shared command parameter options are listfile and groupfile. The read.shared command should be in the following format: read.shared(listfile=yourListFile, groupfile=yourGroupFile). The listfile parameter and groupfile paramaters are required.


The cluster() command:

The cluster command can only be executed after a successful read.phylip or read.column command. The cluster command outputs a .list , .rabund and .sabund files. The cluster command parameter options are method, cuttoff and precision. No parameters are required. The cluster command should be in the following format: cluster(method=yourMethod, cutoff=yourCutoff, precision=yourPrecision). The acceptable methods are furthest, nearest and average. If you do not provide a method the default algorythm is furthest neighbor. The cluster() command outputs three files *.list, *.rabund, and *.sabund described above.


The collect() command:

The collect command generates a collector's curve from the given file. The collect command can only be executed after a successful read.list, read.sabund or read.rabund command, with one exception. The collect command can be executed after a successful cluster command. It will use the .list file from the output of the cluster. The collect command outputs a file for each estimator you choose to use. The collect command parameters are label, line, freq, single. No parameters are required, but you may not use both the line and label parameters at the same time. The collect command should be in the following format: collect(label=yourLabel, line=yourLines, freq=yourFreq, single=yourEstimators). example collect(label=unique-.01-.03, line=0,5,10, freq=10, single=collect-chao-ace-jack). The default values for freq is 100, and single are collect-chao-ace-jack-bootstrap-shannon-npshannon-simpson-rarefraction. The label and line parameters are used to analyze specific lines in your input.


The rarefaction() command:

The rarefaction command generates a rarefaction curve from a given file. The rarefaction command can only be executed after a successful read.list, read.sabund or read.rabund command, with one exception. The rarefaction command can be executed after a successful cluster command. It will use the .list file from the output of the cluster. The rarefaction command outputs a file for each estimator you choose to use. It is recommended to only use rarefaction estimator. The rarefaction command parameters are label, line, iters, freq, rarefaction. No parameters are required, but you may not use both the line and label parameters at the same time. The rarefaction command should be in the following format: rarefaction(label=yourLabel, line=yourLines, iters=yourIters, freq=yourFreq, rarefaction=yourEstimators). Example rarefaction(label=unique-.01-.03, line=0,5,10, iters=10000, freq=10, rarefaction=rarefaction-rchao-race-rjack-rbootstrap-rshannon-rnpshannon-rsimpson). The default values for iters is 1000, freq is 100, and rarefaction is rarefaction which calculates the rarefaction curve for the observed richness. The label and line parameters are used to analyze specific lines in your input.


The summary() command:

The summary command can only be executed after a successful read.list, read.sabund or read.rabund command, with one exception. The collect command can be executed after a successful cluster command. It will use the .list file from the output of the cluster. The summary command outputs a file for each estimator you choose to use. The summary command parameters are label, line, summary. No parameters are required, but you may not use both the line and label parameters at the same time. The summary command should be in the following format: summary(label=yourLabel, line=yourLines, summary=yourEstimators). Example summary(label=unique-.01-.03, line=0,5,10, summary=collect-chao-ace-jack-bootstrap-shannon-npshannon-simpson-rarefraction). The default value for summary is collect-chao-ace-jack-bootstrap-shannon-npshannon-simpson-rarefraction. The label and line parameters are used to analyze specific lines in your input.


The collect.shared() command:

The collect command generates a collector's curve from the given file representing several groups. The collect.shared command can only be executed after a successful read.shared command. It outputs a file for each estimator you choose to use. The collect.shared command parameters are label, line, freq, jumble and shared. No parameters are required, but you may not use both the line and label parameters at the same time. The collect.shared command should be in the following format: collect.shared(label=yourLabel, line=yourLines, freq=yourFreq, jumble=yourJumble, shared=yourEstimators). Example collect.shared(label=unique-.01-.03, line=0,5,10, freq=10, jumble=1, shared=sharedChao-sharedAce-sharedJabund). The default values for jumble is 0 (meaning don’t jumble, if it’s set to 1 then it will jumble), freq is 100 and shared are sharedChao-sharedAce-sharedJabund-sharedSorensonAbund-sharedJclass-sharedSorClass-sharedJest-sharedSorEst-SharedThetaYC-SharedThetaN. The label and line parameters are used to analyze specific lines in your input.


The rarefaction.shared() command:

The rarefaction command generates a rarefaction curve from a given file representing several groups. The rarefaction.shared command can only be executed after a successful read.shared command. It outputs a file for each estimator you choose to use. The rarefaction.shared command parameters are label, line, iters, jumble and sharedrarefaction. No parameters are required, but you may not use both the line and label parameters at the same time. The rarefaction command should be in the following format: rarefaction.shared(label=yourLabel, line=yourLines, iters=yourIters, jumble= yourJumble, sharedrarefaction=yourEstimators). Example rarefaction.shared(label=unique-.01-.03, line=0,5,10, iters=10000, jumble=1, sharedrarefaction =sharedobserved). The default values for jumble is 0 (meaning don’t jumble, if it’s set to 1 then it will jumble), iters is 1000 and sharedrarefaction is sharedobserved which calculates the shared rarefaction curve for the observed richness. The label and line parameters are used to analyze specific lines in your input.


The summary.shared() command

The summary.shared command can only be executed after a successful read.shared command. It outputs a file for each estimator you choose to use. The summary.shared command parameters are label, line, jumble and sharedsummary. No parameters are required, but you may not use both the line and label parameters at the same time. The summary.shared command should be in the following format: summary.shared(label=yourLabel, line=yourLines, jumble=yourJumble, sharedsummary=yourEstimators). Example summary.shared(label=unique-.01-.03, line=0,5,10, jumble=1, sharedsummary=sharedChao-sharedAce-sharedJabund-sharedSorensonAbund-sharedJclass-sharedSorClass-sharedJest-sharedSorEst-SharedThetaYC-SharedThetaN). The default value for jumble is 0 (meaning don’t jumble, if it’s set to 1 then it will jumble) and sharedsummary is sharedChao-sharedAce-sharedJabund-sharedSorensonAbund-sharedJclass-sharedSorClass-sharedJest-sharedSorEst-SharedThetaYC-SharedThetaN. The label and line parameters are used to analyze specific lines in your input.


The shared() command:

The shared command can only be executed after a successful read.shared command. The shared command parses a .list file and separates it into groups. It outputs a .shared file containing the OTU information for each group. There are no shared command parameters. The shared command should be in the following format: shared().


The parselist() command:

The parselist command is similar to the shared command. It parses a list file and separates it into groups. It outputs a .list file for each group. The parselist command parameter options are listfile and groupfile. The parselist command should be in the following format: parselist(listfile=yourListFile, groupfile=yourGroupFile). The listfile parameter and groupfile paramater are required.


The help() command:

The help command should be in the following format: help(). The help command has no required parameters. If you would like help with a specific command you may enter it as a parameter. Example help(read.phylip).


The quit() command:

The quit command terminates the mothur program. The quit command should be in the following format: quit ().

Note: No spaces between parameter labels (i.e. distfile), '=' and parameters (i.e.yourDistfile).


For dotur and sons veterans:

>dotur amazon.dist

         mothur>read.phylip(distfile=amazon.dist)

>dotur –l amazon.dist

         mothur>read.phylip(distfile=amazon.dist)

>dotur –c f amazon

         mothur>cluster(method=furthest)

>dotur –c n amazon

         mothur>cluster(method=nearest)

>dotur –c a amazon

         mothur>cluster(method=average)

>dotur –r amazon.dist

         mothur>rarefaction(rarefaction=rarefaction) note: after you read a listfile

>dotur –p 10 amazon.dist

         mothur>cluster(precision=10) note: after you have read a distfile

>dotur –stop 0.10 amazon.dist

         mothur>cluster(cutoff=0.10) note: after you have read a distfile			

>dotur –i 10000 amazon.dist

         mothur>rarefaction(iters=10000) note: after you have read a listfile			

>./sons –list 70.fn.list –names 70.stool_compare.names mothur>read.shared(listfile=70.fn.list, groupfile=70.stoo;_compare.names) >./sons –list 70.fn.list –names 70.stool_compare.names –i 10000 mothur>rarefaction.shared(iters=10000) note: after a read.shared command >./sons –list 70.fn.list –names 70.stool_compare.names –jumble mothur>collect.shared(jumble=1) note: after a read.shared command

Finally, execution in Windows and Linux (and Mac OSX) is essentially the same. In Windows, you cannot merely double click on the icon to get the program to execute. You must use the “Command Prompt” program found by going Start -> Program Files -> Accessories -> Command Prompt. Then you must type in the path of MOTHUR and your distance file to execute the program:

C:\> “Documents and Settings\pds\Desktop\mothur.exe” “Documents and Settings\pds\Desktop\amazon.dist”

Alternatively, you can change the root path to move to the desired directory and execute MOTHUR from there:

C:\PATH\> mothur.exe amazon.dist

Be forewarned that MOTHUR does not seem to run as quickly in Windows as it does in Linux and I would encourage everyone to align their sequences in ARB, which uses Linux or OSX, and to run MOTHUR in the same operating system.


Output Files

The MOTHUR program creates 33 output files. Although this may initially seem like a lot, you may not choose to output them all. For example, each richness estimator and diversity has it own output file. Mothur will only create a file for the richness estimators you choose to use. All files are tab delimitated and are easily imported in to Excel or your favorite spreadsheet. The user is encouraged to track down the original papers to better understand how they were derived. Most reprints are available online (try www.jstor.org first). MOTHUR produces three output files: *.list, *.rabund, and *.sabund from the cluster command. It produces a file for each estimator you choose to run with the collect, rarefaction, summary, collect.shared, rarefaction.shared or summary.shared commands. They are as follows *.collect, *.rarefaction, *.summary, *.chao, *.ace, *.jack, *.shannon, *.np_shannon, *.simpson, *.bootstrap, *.r_chao, *.r_ace, *.r_jack, *.r_shannon, *.r_npshannon, *.r_simpson, *.r_bootstrap, *.sharedChao, *.sharedAce, *.sharedJabund, *.SharedSorensonAbund, *.SharedJclass, *.SharedSorClass, *.SharedJest, *.SharedSorEst, *.SharedThetaYC, *.SharedThetaN, *.sharedobserved. *.sharedsummary. The shared command produces a *.shared file. The parselist command produces a *.list file for each group represented. I will explain what each file contains and then how the calculations were derived.

*.rabund, *.list

These files contain the number of sequences (*.rabund) and their identity (*.list) in each OTU as a function of distance. In the *.rabund file the first column contains the distance used to define an OTU, the second is the number of OTUs and the remaining columns tell the number of sequences in each OTU. The same information is contained in the *.list file except that instead of the number of sequences in each OTU DOTUR gives the name of each sequence in that OTU separated by commas.

*.sabund

This file contains data for constructing a rank-abundance plot of the OTU data for each distance level. The first column contains the distance and the second is the number of OTUs observed at that distance. The successive values in the row are the number of OTUs that were found once, twice, etc.

*.collect, *.chao, *.ace, *.jack, *.shannon, *.np_shannon, *.simpson, *.bootstrap, *.r_chao, *.r_ace, *.r_jack, *.r_shannon, *.r_npshannon, *.r_simpson, *.r_bootstrap. *.rarefaction

Data to construct collector’s curves for each comparison is provided in their corresponding file. The first line of the each file has a description of each column’s contents, first the number sampled, then each distance level. A row in the file is produced for each frequency level requested. Each row contains the collectors curve data.

*.sharedChao, *.sharedAce, *.sharedJabund, *.SharedSorensonAbund, *.SharedJclass, *.SharedSorClass, *.SharedJest, *.SharedSorEst, *.SharedThetaYC, *.SharedThetaN, *.sharedobserved.

Data to construct collector’s curves for each all the group comparisons are provided in the corresponding files. The first line of the each group’s comparison has a description of each column’s contents, first the number sampled, then each distance level and the two groups analysed. A row in the file is produced for each frequency level requested. Each row contains the collectors curve data for the 2 groups analysed.

*.summary

Data to construct collector’s curves for each comparison at the final distance level are provided in the *.summary file. The first line of the *.summary file has a description of each column’s contents. Each following row contains the end point of the collector’s curve for the given comparison.

*.sharedsummary

Data to construct collector’s curves for each comparison at the final distance level are provided in the *.sharedsummary file. The first line of the *.sharedsummary file has a description of each column’s contents. Each following row contains the distance level, the groups compared and the the end point of the collector’s curve for the given comparison.

*.shared

This file contains the frequency of sequences from each group found in each OTU. Each row consists of the distance being considered, group name, number of OTUS, and the abundance information separated by tabs. The abundance information is as follows. Each subsequent number represents a different OTU so that the number indicates the number of sequences in that group that clustered within that OTU. Note that OTU frequencies can only be compared within a distance definition.


Example Calculations

*.collect and *.r_rarefaction

These are the collector's curve and rarefaction curve data for the number of observed OTUs as a function of distance between sequences and the number of sequences sampled. This is merely a count of the number of OTUs observed at any given point in the sampling process.

By theory, the rarefaction curve should match the following expression:

<math>S_n=S_t-\left ( \frac{\sum_{i=1}^S_t {N - N_i\choose n} }{{N \choose n} } \right )</math>