We will be offering mothur and R workshops throughout 2019. Learn more.

Mothur manual

From mothur
Revision as of 18:25, 12 January 2009 by Westcott (Talk | contribs)

Jump to: navigation, search

MOTHUR Manual

Introduction

MOTHUR is a computer program that uses a distance matrix as the input file and assigns sequences to operational taxonomic units for every distance level that can be used to form OTUs using either the nearest, furthest, or average neighbor clustering algorithms. These are also called, single-linkage, complete-linkage, and UPGMA, respectively. Once sequences are assigned to OTUs, the frequency data for each distance level is used to construct rarefaction and collector's curves for the number of species observed, Shannon's and Simpson's diversity index, and Chao1, ACE, Jackknife, and Bootstrap richness estimators as a function of sampling effort and the distance used to define an OTU. MOTHUR also uses non-parametric estimators to estimate similarity between communities based on membership and structure. MOTHUR determines the number individuals in each community that were sampled for each OTU. Next it calculates collector's curves for the fraction of shared OTUs between the two communities (with and without correcting for unsampled individuals), the Jaccard and Sorenson Indices, and the richness of OTUs shared between the two communities. Standard error values are calculated for entire sequence collection. MOTHUR is freely available as C++ source code and as a windows executable.

This manual is designed to achieve five goals:

  1.	Describe the difference between each of the three sequence assignment algorithms.
  2.	Show how to use MOTHUR
  3.	Describe output files and equations used 
  4.	Validate output by making calculations by hand
  5.	Answer frequently asked questions

If you have any questions, complaints, or praise, please do not hesitate to contact Dr. Patrick D. Schloss at pschloss@microbio.umass.edu


Clustering Algorithms Previous attempts at assigning sequences to OTUs have relied on gazing at a distance matrix and manually assigning sequences to OTUs, assigning OTUs based on BLAST results to the nt database, and using Forrest Rowher's program FASTGROUP. These methods have two main flaws. First, they are typically only used to obtain OTU data for one distance level. Second, they are somewhat arbitrary and prone to error (see our paper). For example, if you have three sequences A, B, and C. A is 2% different from B and C, but B and C are 3% different from each other. How do you define an OTU? Hmmm....

MOTHUR, is a rapid, consistent, and objective way of assigning sequences to an OTU for all possible distance levels. It uses one of three rules: nearest neighbor, average neighbor, and furthest neighbor. Also, if you are interested in studying OTUs at cutoffs of 3, 5, 10 and 20% difference you will get these all in one execution of MOTHUR (and everything in between as well!).

Here is how MOTHUR can assign sequences to OTUs...

Nearest neighbor: Each of the sequences within an OTU are at most X% distant from the most similar sequence in the OTU. Furthest neighbor: All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU. Average neighbor: This method is a middle ground between the other two algorithms.

A Cartoon Example The default is the furthest neighbor for reasons that will be explained below after considering a simplified example.

For the moment, let's assume that instead of being interested in bacterial 16S rRNA sequences, we are instead interested in finding ways to cluster major cities along the eastern seaboard of the United States. This analysis isn't meant to be precise, but to make several points about how we can assign sequences to OTUs. Perhaps we have a theory that all of the cities within a certain distance of each other share some cultural heritage. How will we cluster these? In the map to the right, there are 15 major cities and state capitols.

Insert picture here


By the nearest neighbor method, you start by picking a city and looking for any city that is within 100 miles of it. So let's pick Concord. Boston is within 200 miles of it so Boston joins this group. However, Providence is within 100 miles of the nearest neighbor in this group. You can go on and on down the seaboard until you reach Atlanta, continually finding cities that are within 200 miles of the nearest neighbor in the group. Charleston, Frankfort, and Columbus would form their own OTU and Detroit and Lansing their own. But if your sampling intensity increased to include Toledo, OH and Blacksburg, VA, all of the US cities shown in the map would be in one group ≠ hardly a regional grouping! FASTGROUP would take the first city, Concord, and determine what cities are within 200 miles of it and lump all of those cities into the same group. Then it would

take the next city, not in a group and search for all those cities not already in an OTU, but within 200 miles of it. This could result in decent assignment of cities to groups, or it might not. It all depends on what reference cities are picked.

Insert picture here

If on the other hand we say that every city in a group is at most 200 miles from any other city in the group, then we are using the furthest neighbor approach. So, we see that Albany, Hartford, Concord, Boston, and Providence are all within the same group and Harrisburg, New York City, Dover, and Philadelphia are in separate group and so on. MOTHUR starts by finding the minimum distance between any two sequences and putting those together in an OTU. As the distance is increased the original sequences remain in the OTU, but new sequences must meet the requirement that all they are within a known distance of all the other sequences in the OTU.

Insert picture here


From this cartoon example, it is hopefully clear why we prefer the furthest neighbor approach to the nearest neighbor approach. In addition, FASTGROUP is a quasi-nearest neighbor/furthestneighbor approach whose results are dependent on which sequence is picked as the reference sequence. The average neighbor approach would create groupings that are a compromise between the nearest and furthest neighbor approach. We prefer the furthest neighbor to the average neighbor because it is the more conservative approach and is not sensitive to the number of sequences sampled as the average neighbor would be.

DNA Distance Example During the first step, MOTHUR will produce three files: *.list, *.rabund, and *.sabund. The *.list file tells the user which sequences are in each OTU and the *.rabund file tells the user how many sequences are in each OTU. The *.sabund file tells the user how many OTUs contained a certain number of sequences. The format of these files will be addressed later in this manual.

Consider the following input file constructed from DNADIST (obtained from the DNADIST documentation):

5 A 0.000000 0.303893 0.857546 1.158921 1.542897 B 0.303893 0.000000 0.339731 0.913519 0.619666 C 0.857546 0.339731 0.000000 1.631729 1.293707 D 1.158921 0.913519 1.631729 0.000000 0.165882 E 1.542897 0.619666 1.293707 0.165882 0.000000

First, we'll consider the nearest neighbor algorithm. The smallest non-diagonal distance is 0.165882 between D and E. These two sequences enter into the first OTU at a distance of 0.165882. Since we want to retain the minimum distance, we retain the minimum values between the alpha and beta distances and we can rewrite the matrix as follows:

A 0.000000 0.303893 0.857546 1.158921 B 0.303893 0.000000 0.339731 0.619666 C 0.857546 0.339731 0.000000 1.293707 D,E 1.158921 0.619666 1.293707 0.000000

Now we have a doubleton containing D and E and three singletons. Incidentally, if we were interested in a definition where all sequences within an OTU were identical, then there were five singleton OTUs.

Continuing on, we again look for the smallest distance value and find that between A and B the distance is 0.303893. We repeat the clustering step and find:

A,B 0.000000 0.339731 0.619666 C 0.339731 0.000000 1.293707 D,E 0.619666 1.293707 0.000000

So at a distance of 0.303893 there are two doubleton OTUs and a singleton OTU. The next smallest distance is 0.339731 between the OTU containing A and B and the singleton OTU containing C:

A,B,C 0.000000 0.619666 D,E 0.619666 0.000000

Now we have a tripleton and a doubleton OTU when the distance between any two sequences in an OTU is at most 0.339731. Finally, there is one OTU when the shortest distance between any two sequences in an OTU is 0.619666.

The *.nn.rabund file would look like this: Unique 5 1 1 1 1 1 0.165882 4 2 1 1 1 0.303893 3 2 2 1 0.339731 2 3 2 0.619666 1 5

And the *.nn.list file would look like this:

Unique 5 A B C D E 0.165882 4 A B C D,E 0.303893 3 A,B C D,E 0.339731 2 A,B,C D,E 0.619666 1 A,B,C,D,E

Finally, the *nn.sabund file would look like this: Unique 5 5 0.165882 4 3 2 0.303893 3 1 2 0.339731 2 0 1 1 0.619666 1 0 0 0 0 1

If we repeat the analysis for the furthest neighbor we again look for the smallest distance in the original distance matrix, but this time, we retain the maximum distance between the two sequences we are joining. So for a distance of 0.165882 between D and E, the following matrix would result:

A 0.000000 0.303893 0.857546 1.542897 B 0.303893 0.000000 0.339731 0.913519 C 0.857546 0.339731 0.000000 1.631729 D,E 1.542897 0.913519 1.631729 0.000000

The next smallest distance is between A and B at 0.303893 leading to the following matrix:

A,B 0.000000 0.857546 1.542897 C 0.857546 0.000000 1.631729 D,E 1.542897 1.631729 0.000000

In the next step, the smallest distance value (0.857546) is between the OTU containing A and B and the singleton OTU with only C in it:

A,B,C 0.000000 1.631729 D,E 1.631729 0.000000

Finally the five sequences join the same OTU with a distance of 1.631729 between them.

The resulting *.fn.rabund, *.fn.list, and *.fn.sabund files would look like:

Unique 5 1 1 1 1 1 0.165882 4 2 1 1 1 0.303893 3 2 2 1 0.857546 2 3 2 1.631729 1 5

Unique 5 A B C D E 0.165882 4 A B C D,E 0.303893 3 A,B C D,E 0.857546 2 A,B,C D,E 1.631729 1 A,B,C,D,E

Unique 5 5 0.165882 4 3 1 0.303893 3 1 2 0.857546 2 0 1 1 1.631729 1 0 0 0 0 1

The final method is the average neighbor approach. Instead of picking the lowest or highest distance between the two sequences being joined, it averages the distances in the two columns and rows begin combined. This is an unweighted average so it takes into account the number of sequences in each OTU when making the average. Again using the smallest distance to start the process, the D and E distances are merged:

A 0.000000 0.303893 0.857546 1.350909 B 0.303893 0.000000 0.339731 0.766593 C 0.857546 0.339731 0.000000 1.462718 D,E 1.350909 0.766593 1.462718 0.000000

The next smallest distance is 0.303893 between A and B:

A,B 0.000000 0.598639 1.058751 C 0.598639 0.000000 1.462718 D,E 1.058751 1.462718 0.000000

The next smallest distance is 0.598639 between the doubleton OTU containing A and B and the singleton OTU containing C. But remember that since this is an unweigthed approach, when we average the A,B and C columns and rows, we must multiply the A,B data by two and the C by one:

A,B,C 0.000000 1.193407 D,E 1.193407 0.000000

Finally, all five sequences join the same OTU at a distance of 1.193407. We would obtain *.an.rabund, *.an.list, and *.an.sabund files containing data like this:

Unique 5 1 1 1 1 1 0.165882 4 2 1 1 1 0.303893 3 2 2 1 0.598639 2 3 2 1.193407 1 5

Unique 5 A B C D E 0.165882 4 A B C D,E 0.303893 3 A,B C D,E 0.598639 2 A,B,C D,E 1.193407 1 A,B,C,D,E

Unique 5 5 0.165882 4 3 1 0.303893 3 1 2 0.598639 2 0 2 3 1.193407 1 0 0 0 0 5

The difference between the *.fn.rabund, *.an.rabund, and *.nn.rabund files in this test case is the distance definition of the OTU. In other applications, the clustering of sequences may be different between the two methods. But hopefully you will see how the in the nearest neighbor method all of the sequences within an OTU are at most 0.619666 from any sequence in the final OTU and in the furthest neighbor method all of the sequences in the OTU are at most 1.631729 from every member of the OTU and the average neighbor method is between the two. If you inspect the original distance matrix, this should make some sense.

You should also note that these examples are presented with 106 precision. If you can achieve this with 16S rRNA sequences (or any sequence!) let me know! One of the settings in MOTHUR is the precision parameter which allows you to set it to 10, 100, 1,000, or 10,000. As I alluded to, you are probably unlikely to really have precision up to 10,000 in the average gene sequence. Considering the average sequencing project only obtains about 500 bp from a gene, a more realistic level of precision would be between 100 and 1,000. The default precision is 100, which means that OTUs are reported every 0.01 unless you change the precision parameter (see below). Because of MOTHUR's use of the precision parameter, a distance of 0.03 should not be interpreted as the same as 0.030 or 0.0300. The largest distance that could fall under 0.03 would be 0.0349. If you want the largest distance to be 0.03049, then set the precision to 1,000 and so on.