We will be offering an R workshop December 18-20, 2019. Learn more.

Mothur manual

From mothur
Revision as of 18:42, 12 January 2009 by Westcott (Talk | contribs)

Jump to: navigation, search

MOTHUR Manual

Introduction

MOTHUR is a computer program that uses a distance matrix as the input file and assigns sequences to operational taxonomic units for every distance level that can be used to form OTUs using either the nearest, furthest, or average neighbor clustering algorithms. These are also called, single-linkage, complete-linkage, and UPGMA, respectively. Once sequences are assigned to OTUs, the frequency data for each distance level is used to construct rarefaction and collector's curves for the number of species observed, Shannon's and Simpson's diversity index, and Chao1, ACE, Jackknife, and Bootstrap richness estimators as a function of sampling effort and the distance used to define an OTU. MOTHUR also uses non-parametric estimators to estimate similarity between communities based on membership and structure. MOTHUR determines the number individuals in each community that were sampled for each OTU. Next it calculates collector's curves for the fraction of shared OTUs between the two communities (with and without correcting for unsampled individuals), the Jaccard and Sorenson Indices, and the richness of OTUs shared between the two communities. Standard error values are calculated for entire sequence collection. MOTHUR is freely available as C++ source code and as a windows executable.

This manual is designed to achieve five goals:

  1.	Describe the difference between each of the three sequence assignment algorithms.
  2.	Show how to use MOTHUR
  3.	Describe output files and equations used 
  4.	Validate output by making calculations by hand
  5.	Answer frequently asked questions

If you have any questions, complaints, or praise, please do not hesitate to contact Dr. Patrick D. Schloss at pschloss@microbio.umass.edu


Clustering Algorithms

Previous attempts at assigning sequences to OTUs have relied on gazing at a distance matrix and manually assigning sequences to OTUs, assigning OTUs based on BLAST results to the nt database, and using Forrest Rowher's program FASTGROUP. These methods have two main flaws. First, they are typically only used to obtain OTU data for one distance level. Second, they are somewhat arbitrary and prone to error (see our paper). For example, if you have three sequences A, B, and C. A is 2% different from B and C, but B and C are 3% different from each other. How do you define an OTU? Hmmm....

MOTHUR, is a rapid, consistent, and objective way of assigning sequences to an OTU for all possible distance levels. It uses one of three rules: nearest neighbor, average neighbor, and furthest neighbor. Also, if you are interested in studying OTUs at cutoffs of 3, 5, 10 and 20% difference you will get these all in one execution of MOTHUR (and everything in between as well!).

Here is how MOTHUR can assign sequences to OTUs...

Nearest neighbor: Each of the sequences within an OTU are at most X% distant from the most similar sequence in the OTU. Furthest neighbor: All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU. Average neighbor: This method is a middle ground between the other two algorithms.

A Cartoon Example The default is the furthest neighbor for reasons that will be explained below after considering a simplified example.

For the moment, let's assume that instead of being interested in bacterial 16S rRNA sequences, we are instead interested in finding ways to cluster major cities along the eastern seaboard of the United States. This analysis isn't meant to be precise, but to make several points about how we can assign sequences to OTUs. Perhaps we have a theory that all of the cities within a certain distance of each other share some cultural heritage. How will we cluster these? In the map to the right, there are 15 major cities and state capitols.

Insert picture here


By the nearest neighbor method, you start by picking a city and looking for any city that is within 100 miles of it. So let's pick Concord. Boston is within 200 miles of it so Boston joins this group. However, Providence is within 100 miles of the nearest neighbor in this group. You can go on and on down the seaboard until you reach Atlanta, continually finding cities that are within 200 miles of the nearest neighbor in the group. Charleston, Frankfort, and Columbus would form their own OTU and Detroit and Lansing their own. But if your sampling intensity increased to include Toledo, OH and Blacksburg, VA, all of the US cities shown in the map would be in one group ≠ hardly a regional grouping! FASTGROUP would take the first city, Concord, and determine what cities are within 200 miles of it and lump all of those cities into the same group. Then it would

take the next city, not in a group and search for all those cities not already in an OTU, but within 200 miles of it. This could result in decent assignment of cities to groups, or it might not. It all depends on what reference cities are picked.

Insert picture here

If on the other hand we say that every city in a group is at most 200 miles from any other city in the group, then we are using the furthest neighbor approach. So, we see that Albany, Hartford, Concord, Boston, and Providence are all within the same group and Harrisburg, New York City, Dover, and Philadelphia are in separate group and so on. MOTHUR starts by finding the minimum distance between any two sequences and putting those together in an OTU. As the distance is increased the original sequences remain in the OTU, but new sequences must meet the requirement that all they are within a known distance of all the other sequences in the OTU.

Insert picture here


From this cartoon example, it is hopefully clear why we prefer the furthest neighbor approach to the nearest neighbor approach. In addition, FASTGROUP is a quasi-nearest neighbor/furthestneighbor approach whose results are dependent on which sequence is picked as the reference sequence. The average neighbor approach would create groupings that are a compromise between the nearest and furthest neighbor approach. We prefer the furthest neighbor to the average neighbor because it is the more conservative approach and is not sensitive to the number of sequences sampled as the average neighbor would be.

DNA Distance Example During the first step, MOTHUR will produce three files: *.list, *.rabund, and *.sabund. The *.list file tells the user which sequences are in each OTU and the *.rabund file tells the user how many sequences are in each OTU. The *.sabund file tells the user how many OTUs contained a certain number of sequences. The format of these files will be addressed later in this manual.

Consider the following input file constructed from DNADIST (obtained from the DNADIST documentation):

  5
  A		0.000000	0.303893 	0.857546 	1.158921 	1.542897
  B 		0.303893	0.000000 	0.339731 	0.913519 	0.619666
  C		0.857546 	0.339731 	0.000000 	1.631729 	1.293707
  D 		1.158921 	0.913519	1.631729 	0.000000 	0.165882
  E 		1.542897 	0.619666	1.293707 	0.165882	 0.000000

First, we'll consider the nearest neighbor algorithm. The smallest non-diagonal distance is 0.165882 between D and E. These two sequences enter into the first OTU at a distance of 0.165882. Since we want to retain the minimum distance, we retain the minimum values between the alpha and beta distances and we can rewrite the matrix as follows:

  A 		0.000000 	0.303893	0.857546 	1.158921
  B 		0.303893 	0.000000 	0.339731 	0.619666
  C 		0.857546 	0.339731 	0.000000 	1.293707
  D,E 		1.158921 	0.619666 	1.293707 	0.000000

Now we have a doubleton containing D and E and three singletons. Incidentally, if we were interested in a definition where all sequences within an OTU were identical, then there were five singleton OTUs.

Continuing on, we again look for the smallest distance value and find that between A and B the distance is 0.303893. We repeat the clustering step and find:

  A,B		0.000000 	0.339731 	0.619666
  C 		0.339731 	0.000000 	1.293707
  D,E 		0.619666 	1.293707 	0.000000

So at a distance of 0.303893 there are two doubleton OTUs and a singleton OTU. The next smallest distance is 0.339731 between the OTU containing A and B and the singleton OTU containing C:

  A,B,C 	0.000000 0.619666
  D,E		0.619666 0.000000

Now we have a tripleton and a doubleton OTU when the distance between any two sequences in an OTU is at most 0.339731. Finally, there is one OTU when the shortest distance between any two sequences in an OTU is 0.619666.

The *.nn.rabund file would look like this:

  Unique 	5	1	1	1	1	1
  0.165882 	4	2	1	1	1	
  0.303893 	3	2	2	1
  0.339731 	2	3	2
  0.619666 	1	5

And the *.nn.list file would look like this:

  Unique 	5	A	B	C	D	E
  0.165882 	4	A	B	C	D,E
  0.303893 	3	A,B	C	D,E
  0.339731 	2	A,B,C	 D,E
  0.619666 	1	A,B,C,D,E

Finally, the *nn.sabund file would look like this:

  Unique		5	5
  0.165882 	4	3	2
  0.303893 	3	1	2
  0.339731 	2	0	1	1
  0.619666 	1	0	0	0	0	1

If we repeat the analysis for the furthest neighbor we again look for the smallest distance in the original distance matrix, but this time, we retain the maximum distance between the two sequences we are joining. So for a distance of 0.165882 between D and E, the following matrix would result:

  A		0.000000 	0.303893 	0.857546 	1.542897
  B 		0.303893 	0.000000 	0.339731 	0.913519
  C 		0.857546 	0.339731 	0.000000 	1.631729
  D,E 		1.542897 	0.913519 	1.631729 	0.000000

The next smallest distance is between A and B at 0.303893 leading to the following matrix:

  A,B 		0.000000 	0.857546 	1.542897
  C 		0.857546 	0.000000 	1.631729
  D,E 		1.542897 	1.631729 	0.000000

In the next step, the smallest distance value (0.857546) is between the OTU containing A and B and the singleton OTU with only C in it:

  A,B,C 	0.000000 	1.631729
  D,E 		1.631729 	0.000000

Finally the five sequences join the same OTU with a distance of 1.631729 between them.

The resulting *.fn.rabund, *.fn.list, and *.fn.sabund files would look like:

  Unique 	5	1	1	1	1	1
  0.165882 	4	2	1	1	1
  0.303893 	3	2	2	1
  0.857546 	2	3	2
  1.631729 	1	5
  Unique 	5	A	B	C	D	E
  0.165882 	4	A	B	C	D,E
  0.303893 	3	A,B	C	D,E
  0.857546 	2	A,B,C	 D,E
  1.631729 	1	A,B,C,D,E
  Unique 	5	5
  0.165882 	4	3	1
  0.303893 	3	1	2
  0.857546 	2	0	1	1
  1.631729 	1	0	0	0	0	1

The final method is the average neighbor approach. Instead of picking the lowest or highest distance between the two sequences being joined, it averages the distances in the two columns and rows begin combined. This is an unweighted average so it takes into account the number of sequences in each OTU when making the average. Again using the smallest distance to start the process, the D and E distances are merged:

  A 		0.000000 	0.303893 	0.857546 	1.350909
  B 		0.303893	0.000000 	0.339731 	0.766593
  C 		0.857546 	0.339731 	0.000000 	1.462718
  D,E 		1.350909 	0.766593 	1.462718 	0.000000

The next smallest distance is 0.303893 between A and B:

  A,B 		0.000000 	0.598639 	1.058751
  C 		0.598639 	0.000000 	1.462718
  D,E  	1.058751 	1.462718 	0.000000

The next smallest distance is 0.598639 between the doubleton OTU containing A and B and the singleton OTU containing C. But remember that since this is an unweigthed approach, when we average the A,B and C columns and rows, we must multiply the A,B data by two and the C by one:

  A,B,C 	0.000000 	1.193407
  D,E 		1.193407 	0.000000

Finally, all five sequences join the same OTU at a distance of 1.193407. We would obtain *.an.rabund, *.an.list, and *.an.sabund files containing data like this:

  Unique 		5	1	1	1	1	1
  0.165882 		4	2	1	1	1
  0.303893 		3	2	2	1
  0.598639 		2	3	2
  1.193407 		1	5	
  Unique 		5	A	B	C	D	E
  0.165882 		4	A	B	C	D,E
  0.303893 		3	A,B	C	D,E
  0.598639 		2	A,B,C	 D,E
  1.193407 		1	A,B,C,D,E
  Unique 		5	5
  0.165882 		4	3	1
  0.303893 		3	1	2
  0.598639 		2	0	2	3
  1.193407		1	0	0	0	0	5

The difference between the *.fn.rabund, *.an.rabund, and *.nn.rabund files in this test case is the distance definition of the OTU. In other applications, the clustering of sequences may be different between the two methods. But hopefully you will see how the in the nearest neighbor method all of the sequences within an OTU are at most 0.619666 from any sequence in the final OTU and in the furthest neighbor method all of the sequences in the OTU are at most 1.631729 from every member of the OTU and the average neighbor method is between the two. If you inspect the original distance matrix, this should make some sense.

You should also note that these examples are presented with 106 precision. If you can achieve this with 16S rRNA sequences (or any sequence!) let me know! One of the settings in MOTHUR is the precision parameter which allows you to set it to 10, 100, 1,000, or 10,000. As I alluded to, you are probably unlikely to really have precision up to 10,000 in the average gene sequence. Considering the average sequencing project only obtains about 500 bp from a gene, a more realistic level of precision would be between 100 and 1,000. The default precision is 100, which means that OTUs are reported every 0.01 unless you change the precision parameter (see below). Because of MOTHUR's use of the precision parameter, a distance of 0.03 should not be interpreted as the same as 0.030 or 0.0300. The largest distance that could fall under 0.03 would be 0.0349. If you want the largest distance to be 0.03049, then set the precision to 1,000 and so on.


How to Run MOTHUR

To compile MOTHUR in LINUX type the following: >g++ mothur.C ≠O4 ≠o mothur MOTHUR is run from the command line prompt.

Mothur currently has 17 commands: read.phylip(), read.column(), read.list(), read.rabund(), read.sabund(), read.shared(), cluster(), collect(), collect.shared(), rarefaction(), rarefaction.shared(), summary(), summary.shared(), shared(), parselist(), help() and quit().

The read() commands:

The read.phylip or read.column command must be run before you execute the cluster command. The read.list, read.rabund or read.sabund command must be run before you can execute the collect, rarefaction or summary commands. The read.shared command must be run before you can execute the collect.shared, rarefaction.shared, summary.shared, or shared commands.

The read.phylip command is used to read a distance matrix file in phylip format. The read.phylip command parameter options are distfile, namefile, cutoff and precision. The read.phylip command should be in the following format: read.phylip(distfile=yourDistFile, namefile=yourNameFile, cutoff=yourCutoff, precision=yourPrecision). The distfile parameter is required. If you do not provide a cutoff value 10.00 is assumed. If you do not provide a precision value then 100 is assumed.

The read.column command is used to read a distance matrix file in column format. The read.column command parameter options are distfile, namefile, cutoff and precision. The read.column command should be in the following format: read.column (distfile=yourDistFile, namefile=yourNameFile, cutoff=yourCutoff, precision=yourPrecision). The distfile and namefile parameters are required. If you do not provide a cutoff value 10.00 is assumed. If you do not provide a precision value then 100 is assumed.

The read.list, read.rabund and read.sabund must be run before you execute a collect, rarefaction, summary command. Mothur will generate a .list, .rabund and .sabund upon completion of the cluster command or you may use your own.

The read.list command parameter options are listfile and orderfile. The read.list command should be in the following format: read.list(listfile=yourListFile, orderfile=yourOrderFile). The listfile parameter is required.

The read.sabund command parameter options are sabundfile and orderfile. The read.sabund command should be in the following format: read.sabund(sabundfile=yourSabundFile, orderfile=yourOrderFile). The sabundfile parameter is required.

The read.rabund command parameter options are rabundfile and orderfile. The read.rabund command should be in the following format: read.rabund(rabundfile=yourRabundFile, orderfile=yourOrderFile). The rabundfile parameter is required.

The read.shared must be run before you execute a shared, collect.shared, rarefaction.shared, summary.shared command. Mothur will generate a .list upon completion of the cluster command or you may use your own.

The read.shared command parameter options are listfile and groupfile. The read.shared command should be in the following format: read.shared(listfile=yourListFile, groupfile=yourGroupFile). The listfile parameter and groupfile paramaters are required.


The cluster() command:

The cluster command can only be executed after a successful read.phylip or read.column command. The cluster command outputs a .list , .rabund and .sabund files. The cluster command parameter options are method, cuttoff and precision. No parameters are required. The cluster command should be in the following format: cluster(method=yourMethod, cutoff=yourCutoff, precision=yourPrecision). The acceptable methods are furthest, nearest and average. If you do not provide a method the default algorythm is furthest neighbor. The cluster() command outputs three files *.list, *.rabund, and *.sabund described above.

The collect() command:

The collect command generates a collector's curve from the given file. The collect command can only be executed after a successful read.list, read.sabund or read.rabund command, with one exception. The collect command can be executed after a successful cluster command. It will use the .list file from the output of the cluster. The collect command outputs a file for each estimator you choose to use. The collect command parameters are label, line, freq, single. No parameters are required, but you may not use both the line and label parameters at the same time. The collect command should be in the following format: collect(label=yourLabel, line=yourLines, freq=yourFreq, single=yourEstimators). example collect(label=unique-.01-.03, line=0,5,10, freq=10, single=collect-chao-ace-jack). The default values for freq is 100, and single are collect-chao-ace-jack-bootstrap-shannon-npshannon-simpson-rarefraction. The label and line parameters are used to analyze specific lines in your input.

The rarefaction() command:

The rarefaction command generates a rarefaction curve from a given file. The rarefaction command can only be executed after a successful read.list, read.sabund or read.rabund command, with one exception. The rarefaction command can be executed after a successful cluster command. It will use the .list file from the output of the cluster. The rarefaction command outputs a file for each estimator you choose to use. It is recommended to only use rarefaction estimator. The rarefaction command parameters are label, line, iters, freq, rarefaction. No parameters are required, but you may not use both the line and label parameters at the same time. The rarefaction command should be in the following format: rarefaction(label=yourLabel, line=yourLines, iters=yourIters, freq=yourFreq, rarefaction=yourEstimators). Example rarefaction(label=unique-.01-.03, line=0,5,10, iters=10000, freq=10, rarefaction=rarefaction-rchao-race-rjack-rbootstrap-rshannon-rnpshannon-rsimpson). The default values for iters is 1000, freq is 100, and rarefaction is rarefaction which calculates the rarefaction curve for the observed richness. The label and line parameters are used to analyze specific lines in your input.

The summary() command:

The summary command can only be executed after a successful read.list, read.sabund or read.rabund command, with one exception. The collect command can be executed after a successful cluster command. It will use the .list file from the output of the cluster. The summary command outputs a file for each estimator you choose to use. The summary command parameters are label, line, summary. No parameters are required, but you may not use both the line and label parameters at the same time. The summary command should be in the following format: summary(label=yourLabel, line=yourLines, summary=yourEstimators). Example summary(label=unique-.01-.03, line=0,5,10, summary=collect-chao-ace-jack-bootstrap-shannon-npshannon-simpson-rarefraction). The default value for summary is collect-chao-ace-jack-bootstrap-shannon-npshannon-simpson-rarefraction. The label and line parameters are used to analyze specific lines in your input.

The collect.shared() command:

The collect command generates a collector's curve from the given file representing several groups. The collect.shared command can only be executed after a successful read.shared command. It outputs a file for each estimator you choose to use. The collect.shared command parameters are label, line, freq, jumble and shared. No parameters are required, but you may not use both the line and label parameters at the same time. The collect.shared command should be in the following format: collect.shared(label=yourLabel, line=yourLines, freq=yourFreq, jumble=yourJumble, shared=yourEstimators). Example collect.shared(label=unique-.01-.03, line=0,5,10, freq=10, jumble=1, shared=sharedChao-sharedAce-sharedJabund). The default values for jumble is 0 (meaning don’t jumble, if it’s set to 1 then it will jumble), freq is 100 and shared are sharedChao-sharedAce-sharedJabund-sharedSorensonAbund-sharedJclass-sharedSorClass-sharedJest-sharedSorEst-SharedThetaYC-SharedThetaN. The label and line parameters are used to analyze specific lines in your input.


The rarefaction.shared() command:

The rarefaction command generates a rarefaction curve from a given file representing several groups. The rarefaction.shared command can only be executed after a successful read.shared command. It outputs a file for each estimator you choose to use. The rarefaction.shared command parameters are label, line, iters, jumble and sharedrarefaction. No parameters are required, but you may not use both the line and label parameters at the same time. The rarefaction command should be in the following format: rarefaction.shared(label=yourLabel, line=yourLines, iters=yourIters, jumble= yourJumble, sharedrarefaction=yourEstimators). Example rarefaction.shared(label=unique-.01-.03, line=0,5,10, iters=10000, jumble=1, sharedrarefaction =sharedobserved). The default values for jumble is 0 (meaning don’t jumble, if it’s set to 1 then it will jumble), iters is 1000 and sharedrarefaction is sharedobserved which calculates the shared rarefaction curve for the observed richness. The label and line parameters are used to analyze specific lines in your input.

The summary.shared() command

The summary.shared command can only be executed after a successful read.shared command. It outputs a file for each estimator you choose to use. The summary.shared command parameters are label, line, jumble and sharedsummary. No parameters are required, but you may not use both the line and label parameters at the same time. The summary.shared command should be in the following format: summary.shared(label=yourLabel, line=yourLines, jumble=yourJumble, sharedsummary=yourEstimators). Example summary.shared(label=unique-.01-.03, line=0,5,10, jumble=1, sharedsummary=sharedChao-sharedAce-sharedJabund-sharedSorensonAbund-sharedJclass-sharedSorClass-sharedJest-sharedSorEst-SharedThetaYC-SharedThetaN). The default value for jumble is 0 (meaning don’t jumble, if it’s set to 1 then it will jumble) and sharedsummary is sharedChao-sharedAce-sharedJabund-sharedSorensonAbund-sharedJclass-sharedSorClass-sharedJest-sharedSorEst-SharedThetaYC-SharedThetaN. The label and line parameters are used to analyze specific lines in your input.


The shared() command:

The shared command can only be executed after a successful read.shared command. The shared command parses a .list file and separates it into groups. It outputs a .shared file containing the OTU information for each group. There are no shared command parameters. The shared command should be in the following format: shared().


The parselist() command:

The parselist command is similar to the shared command. It parses a list file and separates it into groups. It outputs a .list file for each group. The parselist command parameter options are listfile and groupfile. The parselist command should be in the following format: parselist(listfile=yourListFile, groupfile=yourGroupFile). The listfile parameter and groupfile paramater are required.


The help() command:

The help command should be in the following format: help(). The help command has no required parameters. If you would like help with a specific command you may enter it as a parameter. Example help(read.phylip).


The quit() command:

The quit command terminates the mothur program. The quit command should be in the following format: quit ().

Note: No spaces between parameter labels (i.e. distfile), '=' and parameters (i.e.yourDistfile).


For dotur and sons veterans:

>dotur amazon.dist

         mothur>read.phylip(distfile=amazon.dist)

>dotur –l amazon.dist

         mothur>read.phylip(distfile=amazon.dist)

>dotur –c f amazon

         mothur>cluster(method=furthest)

>dotur –c n amazon

         mothur>cluster(method=nearest)

>dotur –c a amazon

         mothur>cluster(method=average)

>dotur –r amazon.dist

         mothur>rarefaction(rarefaction=rarefaction) note: after you read a listfile

>dotur –p 10 amazon.dist

         mothur>cluster(precision=10) note: after you have read a distfile

>dotur –stop 0.10 amazon.dist

         mothur>cluster(cutoff=0.10) note: after you have read a distfile			

>dotur –i 10000 amazon.dist

         mothur>rarefaction(iters=10000) note: after you have read a listfile			

>./sons –list 70.fn.list –names 70.stool_compare.names mothur>read.shared(listfile=70.fn.list, groupfile=70.stoo;_compare.names) >./sons –list 70.fn.list –names 70.stool_compare.names –i 10000 mothur>rarefaction.shared(iters=10000) note: after a read.shared command >./sons –list 70.fn.list –names 70.stool_compare.names –jumble mothur>collect.shared(jumble=1) note: after a read.shared command

Finally, execution in Windows and Linux (and Mac OSX) is essentially the same. In Windows, you cannot merely double click on the icon to get the program to execute. You must use the “Command Prompt” program found by going Start -> Program Files -> Accessories -> Command Prompt. Then you must type in the path of MOTHUR and your distance file to execute the program:

C:\> “Documents and Settings\pds\Desktop\mothur.exe” “Documents and Settings\pds\Desktop\amazon.dist”

Alternatively, you can change the root path to move to the desired directory and execute MOTHUR from there:

C:\PATH\> mothur.exe amazon.dist

Be forewarned that MOTHUR does not seem to run as quickly in Windows as it does in Linux and I would encourage everyone to align their sequences in ARB, which uses Linux or OSX, and to run MOTHUR in the same operating system.