We will be offering an R workshop December 18-20, 2019. Learn more.
look for << >> tags
This goal of this tutorial is to demonstrate the standard operating procedure (SOP) that the Schloss lab uses to process their 16S rRNA gene sequences that are generated by 454 pyrosequencing. Throughout the tutorial several alternative approaches will be mentioned along with why we might deviate from the SOP to use these approaches. This SOP is largely the product of a series of manuscripts that we have published and users are advised to consult these for more details and background data. The workflow is being divided into several parts shown here in the table of contents for the tutorial:
- 1 Logistics
- 2 Getting started
- 3 Reducing sequencing error
- 4 Processing improved sequences
- 5 Removing chimeras
- 6 Removing "contaminants"
- 7 Preparing inputs for analysis
- 8 Error analysis
- 9 Analysis
- 10 Phylotype-based analysis
- 11 Phylogeny-based analysis
- 12 Conclusion
The first step of the tutorial is to understand how we typically set up a plate of barcodes for sequencing. First, you can obtain a set of barcodes from multiple soureces. We utilize the "Broad" barcodes that were used for the HMP 16S rRNA gene sequencing effort. You can find these barcodes and primers to sequence the V13, V35, and V69 regions online. Next, it is worth noting that the sequencing is done in the "reverse direction" - e.g. starting at the 3' end of the V5 region, we sequence back towards the 5' end of the V3 region. We picked the V35 region because we liked these primers the best for the types of samples that we generally process (i.e. human and mouse). You may choose to use another region based on the expected biodiversity of your samples. Second, this set of barcodes will allow us to simultaneously sequence 96 samples. A priori, we dedicate two of these barcodes to controls. The first is for resequencing a mock community made up of a defined consortia of 16S rRNA gene sequences where we know the true sequence. This is helpful in assessing sequencing errors and drift in the process. The second is a "generous donor" sample, which is a representative sample (e.g. stool) that we resequence on each plate. This provides a more realistic understanding of how processing drift might be affecting our analysis. Finally, we generally utilize half of a 454 plate for each set of 96 barcodes. A typical plate has two halves (duh) and it is not advisable to put the same barcode-primer combination on these two halves if they are used for different samples as there is leakage across the gasket.
Starting out we need to first determine, what was our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the frist 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods.
When we get our sequences back they arrive as an sff file. Often times people will only get back a fasta file or they may get back the fasta and qual files. You really need the sff file or at least the fasta, qual, and flow file data. If your sequence provider can't or won't provide these data find a new sequence provider. It is also critical to know the barcode and primer sequence used for each sample. We have heard reports from many users that their sequence provider has already trimmed and quality trimmed the data for them. Ahemm... don't trust them. If your sequence provider won't give you your primer or barcode sequences, move on and use someone else. For this tutorial you will need several sets of files. To speed up the tutorial we provide some of the downstream files that take awhile to generate (e.g. the output of shhh.seqs):
- Example data from Schloss lab
- SILVA-based bacterial reference alignment
- mothur-formatted version of the RDP training set
In addition, you probably want to get your hands on the following...
- mono - if you are using Mac OS X or linux
- TextWranger / emacs / vi / or some other text editor
- R, Excel, or another program to graph data
- Adobe Illustrator, Safari, or Inkscape
- TreeView or another program to visualize dendrograms
It is generally easiest to use the "current" option for many of the commands since the file names get very long. Because this tutorial is meant to show people how to use mothur at a very nuts and bolts level, we will only selectively use the current option to demonstrate how it works. Generally, we will use the full file names.
As mentioned above we get data in the form of an sff file and will use mothur's implementation of sffinfo to extract the fasta, qual, and flow data from the binary file:
mothur > sffinfo(sff=GQY1XT001.sff, flow=T)
Sometimes it is not possible to get the sff file because your sequence provider is sketchy or because you are pulling data down from some other scientist's study who didn't provide the data. In those cases you may need to skip that step. Regardless, let's see what our sequences look like that are in the fasta file usig the summary.seqs command:
mothur > summary.seqs(fasta=GQY1XT001.fasta) Start End NBases Ambigs Polymer Minimum: 1 39 39 0 2 2.5%-tile: 1 116 116 0 4 25%-tile: 1 454 454 0 4 Median: 1 512 512 0 5 75%-tile: 1 539 539 0 5 97.5%-tile: 1 558 558 1 6 Maximum: 1 1040 1040 28 31 # of Seqs: 763373
Reducing sequencing error
Our SOP will use the shhh.seqs command, which is the mothur implementation of the AmpliconNoise algorithm. Others that don't have the computational capacity or don't have their flow data may want to jump ahead to the trim.seqs section of the tutorial. The rest of the SOP will assume that you are using the SOP approach.
To run shhh.seqs, we need to process our flow data to do several things. First, we need to separate each flowgram according to the barcode and primer combination. We also need to make sure that our sequences are a minimum length and we need to cap the the number of flowgrams to that length. The default of trim.flows is 450 flows. If you are using GSFLX data you will probably want to play around with something in the order of 300-350 flows. We will also allow for 1 mismatch to the barcode and 2 mismatches to the primer. Running trim.flows will make use of the oligos file that is provided with the SOP Data that you pulled down from the wiki. You may want to change the number of processors to suit your hardware:
mothur > trim.flows(flow=GQY1XT001.flow, oligos=GQY1XT001.oligos, pdiffs=2, bdiffs=1, processors=8)
This will create 96 separate trim.flows and scrap.flows files as well as a file called GQY1XT001.flow.files, which holds the names of each of the trim.flows files. Next we will want to run shhh.seqs to de-noise our sequence data. This can be run several ways depending on your hardware or operating system. The first way is to use the precomputed output for this dataset that came with the files you downloaded. Clearly this won't work for your data :). The second option is to use the MPI-compiled version of mothur. To run this you will need to exit mothur and enter the following at the prompt:
mpirun -np 8 mothurMPI "#shhh.seqs(file=GQY1XT001.flow.files)"
Alternatively, we can stay within mothur and run the following:
mothur > shhh.seqs(file=GQY1XT001.flow.files, processors=8)
The first option should be a lot faster than the second because it is fully parallelized, but in reality it really doesn't seem to matter much. It took me about 8 hours (wall time) to run shhh.seqs on the 96 samples in GQY1XT001.flow.files.
The two important files from running shhh.seqs are the shhh.fasta and shhh.names. You will now feed these into trim.seqs to actually remove the barcode and primer sequences, make sure everything is 200 bp long, remove sequences with homoopolymers longer than 8 bp, and get the reverse complement for each sequence. This will also create a new names file, which maps the names of redundant sequences to a unique sequence and it will create a group file that indicates what group or sample each sequence came from:
mothur > trim.seqs(fasta=GQY1XT001.shhh.fasta, name=GQY1XT001.shhh.names, oligos=GQY1XT001.oligos, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, flip=T, processors=8)
Lets see what the new fasta data look like:
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.fasta, name=GQY1XT001.shhh.trim.names)
To demonstrate the current option, notice that you can also run this command as follows and get the same output:
mothur > summary.seqs() Start End NBases Ambigs Polymer Minimum: 1 247 247 0 3 2.5%-tile: 1 255 255 0 4 25%-tile: 1 261 261 0 4 Median: 1 266 266 0 4 75%-tile: 1 273 273 0 5 97.5%-tile: 1 292 292 0 6 Maximum: 1 326 326 0 8 # of unique seqs: 178478 total # of seqs: 563194
If you are ever curious what is "current" and what isn't, you can run:
mothur > get.current()
If for some reason you are unable to run shhh.seqs with your data, a good alternative is to use the trim.seqs command using a 50-bp sliding window and to trim the sequence when the average quality score over that window drops below 35. Our results suggest that the sequencing error rates by this method are very good, but not quite as good as by shhh.seqs and that the resulting sequences tend to be a bit shorter.
mothur > trim.seqs(fasta=GQY1XT001.fasta, oligos=GQY1XT001.oligos, qfile=GQY1XT001.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2)
If you are in a bind and do not have access to the quality score data the following will get you through...
mothur > trim.seqs(fasta=GQY1XT001.fasta, oligos=GQY1XT001.oligos, qfile=GQY1XT001.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, keepfirst=225, processors=2) Basically what we've found is that the sequence quality really nose dives after 250 bp for Titanium data. Please send all complaints to 454, not me. Let's take a look at what the sequences look like from this approach: mothur > summary.seqs(fasta=GQY1XT001.trim.fasta)
Again, you will have fewer sequences and they will be shorter than those obtained by the shhh.seqs approach.
Processing improved sequences
Again, this SOP uses the output from the shhh.seqs approach. If you used the trim.seqs approach, you will need to adjust the file names appropriately. The next step is to simplify the dataset by working with only the unique sequences. We aren't chucking anything here, we're just making your CPU's life a bit easier. We will do this with the unique.seqs command. Because we already have a names file going, we will need to provide that as input with our fasta sequence:
mothur > unique.seqs(fasta=GQY1XT001.shhh.trim.fasta, name=GQY1XT001.shhh.trim.names)
Looking at the fasta data we see...
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.fasta, name=GQY1XT001.shhh.trim.names) Start End NBases Ambigs Polymer Minimum: 1 247 247 0 3 2.5%-tile: 1 255 255 0 4 25%-tile: 1 261 261 0 4 Median: 1 266 266 0 4 75%-tile: 1 273 273 0 5 97.5%-tile: 1 292 292 0 6 Maximum: 1 326 326 0 8 # of unique seqs: 80534 total # of seqs: 563194
We have basically simplified the dataset by 7-fold. This means that a number of the downstream analyses will be 7 to 50 times faster than if we were working with the unique data. Next, we need to generate an alignment of our data using the align.seqs command by aligning our data to the SILVA-compatible alignment database reference alignment. I prefer the silva reference alignment, for the reasons I articulated in a recent PLoS Computational Biology paper that I published. For now, if your computer has less than 2 GB of RAM you should probably stick with the greengenes reference alignment and tell your PI to order you some more RAM.
mothur > align.seqs(fasta=GQY1XT001.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, processors=8)
Alternatively, you can run it as...
mothur > align.seqs(reference=silva.bacteria.fasta, processors=8)
Taking a look at the newly aligned sequences we see:
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.names)
Start End NBases Ambigs Polymer Minimum: 1044 1051 4 0 2 2.5%-tile: 16116 27659 255 0 4 25%-tile: 16312 27659 261 0 4 Median: 16352 27659 266 0 4 75%-tile: 16422 27659 273 0 5 97.5%-tile: 21280 27659 292 0 6 Maximum: 43097 43116 314 0 8 # of unique seqs: 80534 total # of seqs: 563194
We can see that pretty much all of our sequences end at position 27659 of the alignment space (the full alignment is 500000 columns long). We also see that 97.5% of our sequences are at least 255 bp long. To make sure that all of the sequences overlap in the same alignment space we need to remove those sequences that are outside the desired range using the screen.seqs command. There are several ways to run this command that are all perfectly acceptable. The goal is to think about and then optimize the criteria so that you have as many long sequences as possible - these two factors are inversely related to each other:
Option 1 (SOP) This is the classical appraoch. What we noticed from running summary.seqs was that pretty much every sequence ended by position 27659. This makes sense since the we sequenced backwards from the 3' end of the V5 region. We would like to set a minimum length. Based on the results of summary.seqs 250 bp seems reasonable as a minimum length. To do this we run the following command:
mothur > screen.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.names, group=GQY1XT001.shhh.groups, end=27659, minlength=250, processors=8)
Option 2 We can remove some of the guess work by having mothur fulfil and optimise some criteria. Again, let's set the end position at 27659 and have mothur predict the sequence length that will allow us to keep 85% of our sequences. We can do this as follows:
mothur > screen.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.names, group=GQY1XT001.shhh.groups, end=27659, optimize=minlength, criteria=85, processors=8)
Option 3 One problem with optimizing the number of sequences based on the minimum length is that sequences (especially in the V1 region) vary in length when they cover the same region. So it makes more sense to optimize the number of sequences by the start position. We can do this as follows:
mothur > screen.seqs(fasta=trim.unique.align, name=trim.names, group=groups, end=27759, optimize=start, criteria=85, processors=2)
Play with this a bit and see what options you think work best for your dataset. Let's see what the output of Option 1 look like:
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.good.align, name=GQY1XT001.shhh.trim.good.names)
Start End NBases Ambigs Polymer Minimum: 15690 27659 250 0 3 2.5%-tile: 16116 27659 255 0 4 25%-tile: 16312 27659 261 0 4 Median: 16352 27659 266 0 4 75%-tile: 16422 27659 273 0 5 97.5%-tile: 21280 27659 292 0 6 Maximum: 21295 28437 314 0 8 # of unique seqs: 80356 total # of seqs: 562985
We see that all of our sequences start at or after position 27659 and they are all at least 250 bp in length. The sequence that is 314 bp is suspicious, but oh well, let's move on. Next, we need to filter our alignment so that all of our sequences only overlap in the same region and to remove any columns in the alignment that don't contain data. We do this by running the filter.seqs command. Sometimes people run this and find that they don't have anything left; this is typically because of problems in the screen.seqs step above.
mothur > filter.seqs(fasta=GQY1XT001.shhh.trim.unique.good.align, vertical=T, trump=., processors=8)
In this command trump=. will remove any column that has a "." character, which indicates missing data. The vertical=T option will remove any column that contains exclusively gaps. We should see that our alignment is now 555 columns long. That's another pretty significant reduction in data that we have to process, plus it makes our analysis more robust. Because we forced everything to the same alignment space with no overhangs, we probalby have a number of sequences that are now redundant over this region. Let's run unique.seqs again to further simplify the dataset:
mothur > unique.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.fasta, name=GQY1XT001.shhh.trim.good.names)
This will reduce the number of sequences from 80356 to 55601. The final step we can take to reduce our sequencing error is to use the pre.cluster command to merge sequence counts that are within 2 bp of a more abundant sequence. As a rule of thumb we use a difference of 1 bp per 100 bp of sequence length:
mothur > pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.names, diffs=2) 55600 22984 32617 55601 22984 32617 Total number of sequences before precluster was 55601. pre.cluster removed 32617 sequences. It took 180 secs to cluster 55601 sequences.
"removed" is a bit strong since we haven't chucked the sequences, only reduced the number of unique sequences and increased the number of duplicate sequences. Now we're down to 22984 unique sequences. Let's see what they look like:
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names) Start End NBases Ambigs Polymer Minimum: 1 555 229 0 3 2.5%-tile: 1 555 246 0 4 25%-tile: 1 555 249 0 4 Median: 1 555 250 0 4 75%-tile: 1 555 250 0 4 97.5%-tile: 1 555 250 0 6 Maximum: 2 555 258 0 8 # of unique seqs: 22984 total # of seqs: 562985
To this point we have been concerned with removing sequencing errors. The next thing we need to do is to identify chimeras. Two methods have been developed to identify chimeras in pyrosequencing data. These have been incorporated in the mothur commands chimera.uchime and chimera.slayer. The syntax is mostly the same. Here is how we run it in the lab...
mothur > chimera.uchime(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names) ... Processing sequence: 22800, 14877 chimeras found. Processing sequence: 22900, 14926 chimeras found. Processing sequence: 22984, 14990 chimeras found. It took 1682 secs to check 22984 sequences.
The upside of this approach is that you get to use your more abundant sequecnes as the reference database. The idea is that chimeras should be rarer than their more abundant parent seqeunces. Alternatively, you can use a reference database. This approach should have lower sensitivity and specificity, but be faster because it can be parallelized. Also, it is pretty limited to detecting bacterial chimeras. If you are going to use the database-based approaches, the developers at the Broad institute and of UChime suggest using the Gold sequence database. We provide a silva-compatible alignment of this sequence collection with the silva reference files...
mothur > chimera.uchime(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, reference=silva.gold.ng.fasta, processors=8)
Having chosen the first method, we will now remove the 14990 unique sequences (!) that were identified as chimeric.
mothur > remove.seqs(accnos=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.uchime.accnos, fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups) Removed 74877 sequences.
This last line tells us that although there were 14990 unique sequences flagged as chimeras, they really represented close to 75000 sequences that are being removed from further analysis. Let's see what we've got left...
mothur > summary.seqs() Start End NBases Ambigs Polymer Minimum: 1 555 229 0 3 2.5%-tile: 1 555 246 0 4 25%-tile: 1 555 249 0 4 Median: 1 555 250 0 4 75%-tile: 1 555 250 0 4 97.5%-tile: 1 555 250 0 6 Maximum: 2 555 258 0 8 # of unique seqs: 7994 total # of seqs: 488108
There is one more step we can take to improve our data quality. Because chloroplasts and mitochindria are geenrally considered to be "former" bacteria and not an active component to a microbial community, the next thing we will do is to remove sequences affiliated with organelles from our dataset. To do this, we first need to classify our sequences using the mothur version of the "Bayesian" classifier. We can do this with the classify.seqs command using the RDP reference files you downloaded above:
mothur > classify.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, template=trainset6_032010.rdp.fasta, taxonomy=trainset6_032010.rdp.tax, processors=8)
Now let's use the remove.lineage command to remove those sequences that classified as "Chloroplasts" (Mitochondria are not in the RDP training set :( ):
mothur > remove.lineage(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.names, group=GQY1XT001.shhh.good.pick.groups, taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.rdp.taxonomy, taxon=Cyanobacteria) mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names) Start End NBases Ambigs Polymer Minimum: 1 555 229 0 3 2.5%-tile: 1 555 246 0 4 25%-tile: 1 555 249 0 4 Median: 1 555 250 0 4 75%-tile: 1 555 250 0 4 97.5%-tile: 1 555 250 0 6 Maximum: 2 555 258 0 8 # of unique seqs: 7991 total # of seqs: 488087
While not a huge reduction, we see that there were 21 Chloroplast sequences in the database. Later we will come back and assess how good our data actually look, but let's kick that down the road and get going with the analysis. If you have a good reason to think that some sequences are contaminants, you can use a similar approach to remove those.
Preparing inputs for analysis
We pride ourselves on generating files with the longest possible file names. I mean who doesn't love typing out GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta final.fasta? Let's use the mothur system command to make copies of some files so that we don't develop carpel tunnel syndrom:
mothur > system(cp GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.rdp.pick.taxonomy final.taxonomy) mothur > system(cp GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names final.names) mothur > system(cp GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta final.fasta) mothur > system(cp GQY1XT001.shhh.good.pick.pick.groups final.groups)
Note that "cp" is the copy command for mac and unix users. If your world is run by Bill Gates, you will want to use the "copy" command insted.
To get ready for doing the fun part of analysis, we need to remember that there are three general approaches to analyzing 16S rRNA gene sequences. You can analyze sequences binned into OTUs or phylotypes or you can use the phylogenetic tree-based approaches.
Option 1 To build OTUs within mothur there are several options. The SOP in the Schloss lab is to first generate a distance matrix using the dist.seqs command. We will use a cutoff of 0.25, which means we'll chuck any pairwise distance larger than 0.25:
mothur > dist.seqs(fasta=final.fasta, cutoff=0.25, processors=8)
If the output of this command is in the >4 GBs of memory range you have probably done something incorrectly above or chosen to ignore some of my advice... Next, we want to cluster these sequences into OTUs based on the newly created distance matrix and the names file we've been keeping track of along the way using the cluster command:
mothur > cluster(column=final.dist, name=final.names)
Typically we are interested in cutoffs of 0.03 for further analysis.
Option 2 An alternative option, which is generally an awful choice, is to use the hcluster command. By default hcluster and cluster use the average neighbor method. If for some reason you want to use the furthest or nearest neighbor approach, then hcluster may be a good alternative because it will have very lite memory requirements compared to the cluster option. For average neighbor clustering, however, it has pretty dreadful performance.
mothur > hcluster(column=final.dist, name=final.names, method=average)
The output from Option 2 should be the same as Option 1.
Option 3 Generally, I'll set up the cluster command and let it rip overnight and all will be well in the morning. If I'm in a pinch (e.g. I have to give a talk in an hour and still don't have my slides done), I'll use the cluster.split command. It is wicked fast and the only reason I don't put it in the SOP is because I think that sometimes things that are hard are better. Regardless, if you use a taxlevel of 2-3 (Phylum or Class) the output should be about the same and you won't owe Gaia any carbon credits:
mothur > cluster.split(fasta=final.fasta, taxonomy=final.taxonomy, name=final.names, taxlevel=3, processors=4)
Like I said above, the output from Option 3 should be the same as Option 1.
Now that we have a list file from one of these options, we would like to create a table that indicates the number of times an OTU shows up in each sample. This is called a shared file and can be created using the make.shared file:
mothur > make.shared(list=final.an.list, group=final.groups, label=0.03)
The final step to getting good OTU data is to normalize the number of sequences in each sample. We have observed that the number of spurious OTUs is correlated with the number of sequences. Because of this, we standardize the number of sequences across samples so that they all have the same opportunity to be "wrong". Most of these spurious OTUs are chimeras that we could not detect. To do this we will pick a minimum number of sequences that a sample has and randomly select this number of sequences from each sample. This is not the ideal solution because you could get lucky/unlucky in how you pick sequences (filed under things to worry about in the middle of the night if you've run out of other things to worry about):
mothur > sub.sample(shared=final.an.shared, size=3501)
This removes a few groups that that didn't have enough sequences. Get on the horn and yell at the sequencing center or read a book about the probability of getting 96 evenly represented samples with 5000 sequences each.
The last thing we'd like to do is to get the taxonomy information for each of our OTUs. To do this we will use the classify.otu command to give us the majority consensus taxonomy:
mothur > classify.otu(list=final.an.list, name=final.names, taxonomy=final.taxonomy, label=0.03)
With the completion of that last step, I generally have no reason to assign sequences to phylotypes. We do the following steps to make clinicians and people that think the names mean something happy. The phylotype command goes through the taxonomy file and bins sequences together that have the same taxonomy. Here we do it to the genus level:
mothur > phylotype(taxonomy=final.taxonomy, name=final.names, label=1)
Like above, we want to make a shared file and standardize the number of sequences in each group:
mothur > make.shared(list=final.tx.list, group=final.groups, label=0.03) mothur > sub.sample(shared=final.tx.shared, size=3501)
Finally, just to keep things consistent, we get the taxonomy of each phylotype:
mothur > classify.otu(list=final.tx.list, name=final.names, taxonomy=final.taxonomy, label=1)
There are many ways to construct phylogenetic trees. We have ported clearcut into mothur to generate neighbor joining trees. By default we do not use their heuristic, so these are real neighbor joining trees which you may or may not think are "real". First we need a phylip-formatted distance matrix, which we calculate with dist.seqs:
mothur > dist.seqs(fasta=final.fasta, output=lt, processors=8)
Now we call on clearcut:
mothur > clearcut(phylip=final.dist)
Before get rolling with the error analysis, let's take a gander at the quality of our data to this point. In the folder you pulled down was a called HMP_MOCK.v35.align which is a full alignment of the sequences that were put into the mock community and sequenced. Because our sequences have been filtered, we need to use the filter that was generated when we ran filter.seqs previously to filter this alignment. We do this as follows:
mothur > filter.seqs(fasta=HMP_MOCK.v35.align, hard=GQY1XT001.filter)
Next, we need to get the sequences that correspond to our mock community (the group name is MOCK.GQY1XT001):
mothur > get.groups(fasta=final.fasta, group=final.groups, groups=MOCK.GQY1XT001, name=final.names)
Now we can use the seq.error command to get the error rate for those sequences by comparing our sequences to the references:
mothur > seq.error(fasta=final.pick.fasta, name=final.pick.names, reference=HMP_MOCK.v35.filter.fasta, processors=8) Overall error rate: 0.000103249 Errors Sequences 0 3678 1 7 2 4 3 10 4 8 5 8 6 9 7 8 8 8 9 9 10 9 It took 1 secs to check 98 sequences.
That's right, we've reduced the sequencing error rate from ~0.8% to 0.01%. It should be added that this is probably a conservative estimate of error since some chimeras may have slipped through in calculating the error rate (chimeras are not sequencing errors).
Now let's see how many chimeras seq.error threw out as being chimeric. You might have to open final.pick.error.chimera in a spreadsheet...
mothur > system(grep -c "2$" final.pick.error.chimera) 59
That means that of the remaining 98 unique sequences, at least 59 of them are chimeric. Look for future versions of mothur to report the sensitivity, specificity, and false detection rate for chimera checking.
Next, we want to see how many OTUs and phylotypes there should be and how many we observed in the mock community data. To do this we need to create a file either using command line tools or a spreadsheet. What we want is a file called mock.accnos that contains the unique sequence names contained in column 2 of the file ending "error.summary". This is important because, just because you put a sequence in a tube doesn't mean that it got sequenced. Here's how I did this with shell commands...
cut -f 2 *error.summary | sort | uniq > mock.accnos nano mock.accnos #remove reference from the file
Returning to mothur, let's get these sequecnes from the filtered fasta sequence data and calculate OTUs:
mothur > get.seqs(accnos=mock.accnos, fasta=HMP_MOCK.v35.filter.fasta) mothur > dist.seqs(fasta=HMP_MOCK.v35.filter.pick.fasta, output=lt) mothur > cluster(phylip=HMP_MOCK.v35.filter.pick.phylip.dist)
Looking at the output files, we see that if there were no chimeras and no sequencing errors, there should only be 18 OTUs. If we look at the contents of final.an.subsample.shared and pull out the data for the MOCK.GQY1XT001 sample, there were actually 32 OTUs observed (there were 42 before we ran sub.sample). Considering the large number of remaining chimeras not detected by chimera.uchime this isn't so bad. Regardless, it still sucks.
mothur > classify.seqs(fasta=HMP_MOCK.v35.filter.pick.fasta, template=trainset6_032010.rdp.fasta, taxonomy=trainset6_032010.rdp.tax, processors=8) mothur > phylotype(taxonomy=HMP_MOCK.v35.filter.pick.rdp.taxonomy, label=1)
Here we see that there should have been 17 genera in a world without chimeras or sequencing errors. Yet if we look at the final.tx.subsample.shared file we see there were actually 20 genera in the real world (there were 24 before we used sub.sample).
So now we know how good (or bad) our data are. If you wanted you could start back at the step after the sffinfo command and start messing around and seeing how much this output changes. I've done this for 90 mock communities. What you've been doing is what I came up with. Clearly there is room for improvement in the area of chimera prevention and detection, but I'd say that we've got the actual sequencing error down pretty well. I'll add that the only way we can do this type of analysis is because we dedicated one barcode to a mock community. If we were sequencing on numerous plates we could see whether there was variation in the error rates or in the relative abundance of the OTUs/phylotypes. Finally, if your PI (or sequencing center) freaks out that you "threw away" a bunch of data, feel free to reanalyze these data however you (or he/she) see fit and show them how it changes the output.
Moving on, let's do something more interesting and actually analyze our data. Remember that our initial question had to do with the stability and change in community structure in these samples when comparing early and late samples. Keep in mind that the group names have either a F or M (sex of animal) followed by a number (number of animal) followed by a D and a three digit number (number of days post weaning).
Let's start our analysis by analyzing the alpha diversity of the samples. First we will generate collector's curve of the Chao1 richness estimators and the inverse Simpson diversity index. To do this we will use the collect.single command. Also, because there are 3500 sequences in this sample, let's only spit data out every 100 sequences:
mothur > collect.single(shared=final.an.subsample.shared, calc=chao-invsimpson, freq=100)
This command will generate file ending in *.chao and *.invsimpson for each sample, which can be plotted in your favorite graphing software package. When you look at these plots you will see that the Chao1 curves continue to climb with sampling; however, the inverse Simpson diversity indices are relatively stable. In otherwords, it isn't really possible to compare the richness of these communities based on the Chao1 index, but it is with the inverse Simpson index. As a quick aside, it is important to point out that Chao1 is really a measure of the minimum richneess in a community, not the full richness of the community. One method often used to get around this problem is to look at rarefaction curves describing the number of OTUs observed as a function of sampling effort. We'll do this with the rarefaction.single command:
mothur > rarefaction.single(freq=100)
This will generate files ending in *.rarefaction, which again can be plotted in your favorite graphing software package. Alas, rarefaction is not a measure of richness, but a measure of diversity. If you consider two communities with the same richness, but different evenness then after sampling a large number of individuals their rarefaction curves will asymptote to the same value. Since they have different evennesses the shapes of the curves will differ. Therefore, selecting a number of individuals to cutoff the rarefaction curve isn't allowing a researcher to compare samples based on richness, but their diversity. Another alternative method where the richness estimate is not sensitive to sampling effort is to use parametric estimators of richness using the catchall command. By increasing sampling effort, the confidence interval about the richness estimate will shrink.
mothur > catchall()
mothur > summary.single(calc=nseqs-coverage-sobs-invsimpson)
These data will be outputted to a table in a file called final.an.groups.summary. Interestingly, the sample coverages were all above 97%, indicating that we did a pretty good job of sampling the communities. Plotting the richness or diversity of the samples would show that there was little difference between the different animals or between the early and late time points. You could follow this up with a repeated-measures ANOVA and find that there was no significant difference based on sex or early vs. late.
Beta diversity measurements
Now we'd like to compare the membership and structure of the various samples using an OTU-based approach. Let's start by generating a heatmap of the relative abundance of each OTU across the 24 samples using the heatmap.bin command and log2 scaling the relative abundance values. Because there are so many OTUs, let's just look at the top 50 OTUs:
mothur > heatmap.bin(scale=log2, numotu=50)
This will generate an SVG-formatted file that can be visualized in Safari or manipulated in graphics software such as Adobe Illustrator. Needless to say these heatmaps can be a bit of Rorshock. A legend can be found at the bottom left corner of the heat map.
Now let's calculate the similarity of the membership and structure found in the various samples and visualize those similarities in a heatmap with the Jaccard and thetayc coefficients. We will do this with the heatmap.sim command:
mothur > heatmap.sim(calc=jclass-thetayc)
The output will be in two SVG-formatted files called final.an.0.03jclass.heatmap.sim.svg and final.an.0.10thetayc.heatmap.sim.svg. In all of these heatmaps the red colors indicate communities that are more similar than those with black colors.
When generating Venn diagrams we are limited by the number of samples that we can analyze simultaneously. Let's take a look at the Venn diagrams for the first 4 time points of female 3 using the venn command:
mothur > venn(groups=F003D000-F003D001-F003D002-F003D003)
This generates an two 4-way Venn diagrams. This shows that there were a total of 430 OTUs observed between the 4 time points. Only 54 of those OTUs were shared by all four time points. Roughly 30-35 of the sequences in any time point were unique to that time point. We could look deeper at the shared file to see whether those OTUs were numerically rare or just had a low incidence.
Let's use two non-parametric estimators to see what the predicted minimum number of overlapping OTUs is for the same samples using the summary.shared command:
mothur > summary.shared(calc=sharedchao, groups=F003D000-F003D001-F003D002-F003D003, all=T)
Opening the final.an.sharedmultiple.summary file we see a prediction that over these four time points the core microbiome contained at least 62 OTUs.
mothur > tree.shared(calc=thetayc-jclass)
This command generates two newick-formatted tree files - final.an.thetayc.0.03.tre and final.an.jclass.0.03.tre - that can be visualized in software like TreeView or FigTree. Inspection of the both trees shows that individuals' communities cluster with themselves to the exclusion of the others. We can test to deterine whether the clustering within the tree is statistically significant or not using by choosing from the parsimony, unifrac.unweighted, or unifrac.weighted commands. To run these we will first need to create a design file that indicates which treatment each sample belongs to. There is a file called mouse.sex_time.design in the folder you downloaded that looks vaguely like this:
F003D000 F003Early F003D001 F003Early F003D002 F003Early F003D003 F003Early F003D004 F003Early ... M002D144 M002Late M002D145 M002Late M002D146 M002Late M002D147 M002Late M002D148 M002Late M002D149 M002Late M002D150 M002Late
Using the parsimony command:
mothur > parsimony(tree=final.an.subsample.thetayc.0.03.tre, group=mouse.sex_time.design) Tree# Groups ParsScore ParsSig 1 F003Early-F003Late-F004Early-F004Late-M001Early-M001Late-M002Early-M002Late 35 <0.001
There's clearly a animal/time combination that is significantly different from the others. Let's look at the pairwise comparisons. Specifically, let's focus on the early vs. late comparisons for each mouse:
mothur > parsimony(tree=final.an.subsample.thetayc.0.03.tre, group=mouse.sex_time.design, groups=all) 1 F003Early-F003Late 1 <0.001 1 F004Early-F004Late 7 0.651 1 M001Early-M001Late 2 <0.001 1 M002Early-M002Late 3 <0.001
For three of the four mice there was a significant difference between the clustering of the early and late time points. Recall that this method ignores the branch length. Let's incorporate the branch length using the unifrac commands:
mothur > unifrac.weighted(tree=final.an.subsample.thetayc.0.03.tre, group=mouse.sex_time.design, random=T)
Tree# Groups WScore WSig 1 F003Early-F003Late 0.974851 <0.001 1 F004Early-F004Late 0.957132 <0.001 1 M001Early-M001Late 0.919389 <0.001 1 M002Early-M002Late 0.783338 <0.001
Clearly when we incorporate the branch length from the dendrogram and incorporate the weighting, the early and late time series are significantly different for each mouse.
mothur > unifrac.unweighted(tree=final.an.subsample.thetayc.0.03.tre, group=mouse.sex_time.design, random=T)
Tree# Groups UWScore UWSig 1 F003Early-F003Late-F004Early-F004Late-M001Early-M001Late-M002Early-M002Late 0.706791 <0.001
Again, one of these groups is significantly different from the others. Let's look closer:
mothur > unifrac.unweighted(tree=final.an.subsample.thetayc.0.03.tre, group=mouse.sex_time.design, random=T, groups=all) Tree# Groups UWScore UWSig 1 F003Early-F003Late 0.996003 0.208 1 F004Early-F004Late 0.980961 0.299 1 M001Early-M001Late 0.928243 0.144 1 M002Early-M002Late 0.897145 0.061
Here we see that if you do not weight the branch length, it is not possible to distinguish between the early and late time points for each animal.
Another popular way of visualizing beta-diversity information is through ordination plots. We can calculate distances between samples using the dist.shared command:
mothur > dist.shared(shared=final.an.subsample.shared, calc=thetayc-jclass)
The two resulting distance matrices (i.e. final.an.subsample.thetayc.0.03.lt.dist and final.an.subsample.jclass.0.03.lt.dist) can then be visualized using the pcoa or nmds plots. Principal Coordinates (PCoA) uses an eigenvector-based approach to represent multidimensional data in as few dimesnsions as possible. Our data is highly dimensional (~22 dimensions).
mothur > pcoa(phylip=final.an.subsample.thetayc.0.03.lt.dist) mothur > pcoa(phylip=final.an.subsample.jclass.0.03.lt.dist)
The output of these commands are three files ending in *dist, *pcoa, and *pcoa.loadings. The final.an.subsample.thetayc.0.03.lt.pcoa.loadings file will tell you what fraction of the total variance in the data are represented by each of the axes. In this case the first and second axis represent 15.4 and 9.9% of the variation (25.4% total) for the thetaYC distances. The output to the screen indicates that the R-squared between the original distance matrix and the distance between the points in 2D PCoA space was 0.37, but that if you add a third dimension the R-squared value increases to 0.72.
Alternatively, non-metric multidimensional scaling (NMDS) tries to preserve the distance between samples using a user defined number of dimensions. We can run our data through NMDS with 2 dimensions with the following commands
mothur > nmds(phylip=final.an.subsample.thetayc.0.03.lt.dist) mothur > nmds(phylip=final.an.subsample.jclass.0.03.lt.dist)
Opening the final.an.thetayc.0.03.lt.nmds.stress file we can inspect the stress and R2 values, which describe the quality of the ordination. Each line in this file represents a different iteration and the configuration obtained in the iteration with the lowest stress is reported in the final.an.thetayc.0.03.lt.nmds.axes file. In this file we find that the lowest stress value was 0.17 with an R-squared value of 0.88. You can test what hapens with three dimensions by the following:
mothur > nmds(phylip=final.an.subsample.thetayc.0.03.lt.dist, mindim=3, maxdim=3)
The stress value drops to 0.10 and the R2 value goes up to 0.95. Not bad. In general, you would like a stress value below 0.20. Thus, we can conclude that, NMDS is better than PCoA. We can plot the two dimensions of the NMDS data by plotting the contents of final.an.subsample.thetayc.0.03.lt.nmds.axes. Again, it is clear that individuals cluster separately from each other.
Ultimately, ordination is a data visualization tool. We might ask if the spatial separation that we see between the early and late plots in the NMDS plot is statistically significant. To do this we have two statistical tools at our disposal. The first analysis of molecular variance (amova), tests whether the centers of the clouds representing a group are more separated than the variation among samples of the same treatment. This is done using the distance matrices we created earlier and does not actually use ordination.
mothur > amova(phylip=final.an.subsample.thetayc.0.03.lt.dist, design=mouse.sex_time.design) F003Early-F003Late-F004Early-F004Late-M001Early-M001Late-M002Early-M002Late-not found Among Within Total SS 4.01071 4.52755 8.53826 df 8 80 88 MS 0.501338 0.0565944 Fs: 8.85844 p-value: <0.001* F003Early-F003Late Among Within Total SS 0.51298 0.669699 1.18268 df 1 18 19 MS 0.51298 0.0372055 Fs: 13.7877 p-value: <0.001* F004Early-F004Late Among Within Total SS 0.415592 0.832656 1.24825 df 1 15 16 MS 0.415592 0.0555104 Fs: 7.48675 p-value: <0.001* M001Early-M001Late Among Within Total SS 1.29726 0.88645 2.18371 df 1 17 18 MS 1.29726 0.0521441 Fs: 24.8784 p-value: <0.001* M002Early-M002Late Among Within Total SS 0.78028 0.866065 1.64634 df 1 18 19 MS 0.78028 0.0481147 Fs: 16.2171 p-value: <0.001* Experiment-wise error rate: 0.05 Pair-wise error rate (Bonferroni): 0.00138889
Here we see from the AMOVA that the "cloud" early and late time points has a significantly different centroid for each mouse. Thus, the observed separation in early and late samples is statistically significant. Another test we can perform is to determine whether the variation in the samples from early and late samples is different. This can be done using a distance-based version of Bartlett's test for homogeneity of variance (homova):
mothur > homova(phylip=final.an.subsample.thetayc.0.03.lt.dist, design=mouse.sex_time.design)
HOMOVA BValue P-value SSwithin/(Ni-1)_values F003Early-F003Late 3.27142 <0.001* 0.0582078 0.0162032 F004Early-F004Late 0.100047 0.593 0.0616507 0.0484929 M001Early-M001Late 0.137934 0.288 0.0594009 0.0456936 M002Early-M002Late 0.324434 0.272 0.0574117 0.0388177 Experiment-wise error rate: 0.05 Pair-wise error rate (Bonferroni): 0.00138889
Here, we see that for the thetaYC distances, there is considerable difference in the variation among the early and late samples only for the female mouse #3. Thus, for many of these samples, it appears that the variance in community structure is not significantly different; however, the variance does decrease in a non-significant manner for each mouse suggesting that there is a trend towards a stabilization of the community.
Next, we might ask which OTUs are responsible for shifting the samples along the two axes. We can determine this by measuring the correlation of the relative abundance of each OTU with the two axes in the NMDS dataset. We do this with the corr.axes command:
mothur > corr.axes(axes=final.an.subsample.thetayc.0.03.lt.nmds.axes, shared=final.an.subsample.shared, method=spearman, numaxes=2)
This command generates the final.an.subsample.spearman.corr.axes file. The data for the first five OTUs look like this...
OTU axis1 axis2 length 1 -0.283120 -0.123701 0.308964 2 0.207242 0.250094 0.324802 3 -0.273777 -0.291597 0.399979 4 -0.245211 0.109626 0.268600 5 0.492499 0.188386 0.527299
What these results show is that OTU1 is responsible for moving points in the negative direction along axes 1 and 2 whereas OTU2 moves it along the positive direction on both axes. OTU4 moves it backwards along axis 1 and in a postive direction for axis 2. Recalling that we classified each OTU earlier, we can open final.an.0.03.cons.taxonomy to see that OTU1 was an OTU corresponding to members of the Bulleidia and that OTU2 was an OTU corresponding to members of the Anaerotruncus. Neither of these are particularly strong movers so you might gaze down the list a bit to find correlation values that are a bit stronger. These data can be plotted in what's known as a biplot where lines radiating from the origin (axis1=0, axis2=0) to the correlation values with each axis are mapped on top of the PCoA or NMDS plots. Later, using the metastats command, we will see another method for describing which populations are responsible for differences seen between specific treatments.
An alternative approach to building a biplot would be to provide data indicating metadata about each sample. For example, we may know the weight, height, blood pressure, etc. of the subjects in these samples. For discussion purposes the file mouse.dpw.metadata is provided and looks something like this:
group age F003D000 0 F003D001 1 F003D002 2 F003D003 3 F003D004 4 ...
mothur > corr.axes(axes=final.an.subsample.thetayc.0.03.lt.nmds.axes, metadata=mouse.dpw.metadata, method=spearman, numaxes=2)
Opening the file mouse.dpw.spearman.corr.axes, we see:
Feature axis1 axis2 length dpw 0.664669 0.251276 0.710581
Indicating that as the dpw increases the communities shift to in the positive direction along both axes.
In addition to the use of corr.axes we can also use metastats to determine whether there are any OTUs that are differentially represented between the samples from men and women in this study. Run the following in mothur:
mothur > metastats(shared=final.an.shared, design=mouse.sex_time.design)
Looking at the screen output for the first ten OTUs from the F003Late and F003Early comparison, we see the following:
Comparing F003Late and F003Early... Feature 1 is significant, p = 0.0009990010 Feature 4 is significant, p = 0.0259740260 Feature 5 is significant, p = 0.0000000012 Feature 6 is significant, p = 0.0039960040 Feature 7 is significant, p = 0.0000000000
Looking in the final.an.0.03.F003Late-F003Early.metastats file for these OTUs we see the following...
OTU mean(group1) variance(group1) stderr(group1) mean_of_counts(group1) mean(group2) variance(group2) stderr(group2) mean_of_counts(group1) p-value 1 4.67 16.64 1.36 5.55 43.03 2530.76 15.16 49.90 0.00 2 0.47 0.83 0.30 0.55 1.36 16.48 1.22 1.72 0.86 3 3.51 9.24 1.01 4.33 5.78 16.16 1.21 6.81 0.14 4 2.46 4.27 0.68 3.00 6.56 33.67 1.74 8.27 0.02 5 3.21 4.56 0.71 4.00 0.21 0.24 0.14 0.27 0.00 6 2.36 3.81 0.65 3.00 0.20 0.45 0.20 0.27 0.00 7 0.74 2.99 0.57 0.88 5.08 73.25 2.58 6.36 0.00
These data tell us that OTUs 1, 4, and 7 were over represented in the F003Early (group2) and that OTUs 5 and 6 were over represented in F003Late (group1). You can look up these and other significant OTUs in the final.an.0.03.cons.taxonomy file that we created earlier.
Phylotype-based analysis is the same as OTU-based analysis, but at a different taxonomic scale. We will leave you on your own to replicate the OTU-based analyses described above with the contents of the final.tx.subsample.shared file.
OTU and phylotype-based analyses are taxonomic approaches that depend on a binning procedure. In contrast, phylogeny-based approaches attempt similar types of analyses using a phylogenetic tree as input instead of a shared file. Because of this difference these methods compare the genetic diversity of different communities.
When using phylogenetic methods, alpha diversity is calculated as the total of the unique branch length in the tree. This is done using the phylo.diversity command:
mothur > phylo.diversity(tree=final.phylip.tre, name=current, group=current)
This will generate a file called final.1.phylodiv.summary. Reading the last column in the file you can ascertain the amount of branch length that is contributed by each group. You can also rarefy the data, again using the phylo.diversity command:
mothur > phylo.diversity(rarefy=T)
We would then want to read across the line for 3501 to compare the phylogenetic diversity across these samples on the same basis that we did the OTU-based analysis.
The unifrac-based metrics are used to assess the similarity between two communities membership (unifrac.unweighted) and structure (unifrac.weighted). We will use these metrics and generate PCoA plots to compare our samples. There are two beta-diversity metrics that one can use - unweighted and weighted.
mothur > unifrac.unweighted(distance=lt, processors=2, random=F) mothur > unifrac.weighted(distance=lt, processors=2, random=F)
These commands will distance matrices (final.tre1.unweighted.phylip.dist and final.tre1.weighted.phylip.dist) that can be analyzed using all of the ordination approaches described above.
There are many other ways that one could analyze these data (hey, we've gotta save something for the paper!). I encourage you to go back and change the settings, use different calculators, come up with a hypothesis using the data and test it. If you think of an analysis that you wish mothur would do, please let us know and we'll see about adding it to the package. There is a certain "pipeline" aspect to this analysis; however, it is also very much an art of working with sequences. If you want to basically do everything that was described above, you can use the batch file and use mothur in the batch mode as follows:
$ ./mothur batch
The file "batch" would contain each line you used above on its own line in a text file