NOTE: Although this is an SOP, it is something of a work in progress and continues to be modified as we learn more. If you are using this protocol in a paper, you must cite the Schloss et al. 2011 PLoS ONE paper and cite the date you accessed this page:
Schloss PD, Gevers D, Westcott SL. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PloS ONE. 6:e27310.
This goal of this tutorial is to demonstrate the standard operating procedure (SOP) that the Schloss lab uses to process their 16S rRNA gene sequences that are generated by 454 pyrosequencing. Throughout the tutorial several alternative approaches will be mentioned along with why we might deviate from the SOP to use these approaches. This SOP is largely the product of a series of manuscripts that we have published and users are advised to consult these for more details and background data. The workflow is being divided into several parts shown here in the table of contents for the tutorial:
- 1 Logistics
- 2 Getting started
- 3 Reducing sequencing error
- 4 Processing improved sequences
- 5 Removing chimeras
- 6 Removing "contaminants"
- 7 Error analysis
- 8 Preparing inputs for analysis
- 9 Analysis
- 10 Conclusion
The first step of the tutorial is to understand how we typically set up a plate of barcodes for sequencing. First, you can obtain a set of barcodes from multiple sources. We utilize the "Broad" barcodes that were used for the HMP 16S rRNA gene sequencing effort. You can find these barcodes and primers to sequence the V13, V35, and V69 regions online. Next, it is worth noting that the sequencing is done in the "reverse direction" - e.g. starting at the 3' end of the V5 region, we sequence back towards the 5' end of the V3 region. We picked the V35 region because we liked these primers the best for the types of samples that we generally process (i.e. human and mouse). You may choose to use another region based on the expected biodiversity of your samples. Second, this set of barcodes will allow us to simultaneously sequence 96 samples. A priori, we dedicate two of these barcodes to controls. The first is for resequencing a mock community made up of a defined consortia of 16S rRNA gene sequences where we know the true sequence. This is helpful in assessing sequencing errors and drift in the process. The second is a "generous donor" sample, which is a representative sample (e.g. stool) that we resequence on each plate. This provides a more realistic understanding of how processing drift might be affecting our analysis. Finally, we generally utilize half of a 454 plate for each set of 96 barcodes. A typical plate has two halves (duh) and it is not advisable to put the same barcode-primer combination on these two halves if they are used for different samples as there is leakage across the gasket.
Starting out we need to first determine, what was our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial and archaeal strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.
When we get our sequences back they arrive as an sff file. Often times people will only get back a fasta file or they may get back the fasta and qual files. You really need the sff file or at least the fasta, qual, and flow file data. If your sequence provider can't or won't provide these data find a new sequence provider. It is also critical to know the barcode and primer sequence used for each sample. We have heard reports from many users that their sequence provider has already trimmed and quality trimmed the data for them. Ahemm... don't trust them. If your sequence provider won't give you your primer or barcode sequences, move on and use someone else. For this tutorial you will need several sets of files. To speed up the tutorial we provide some of the downstream files that take awhile to generate (e.g. the output of shhh.flows):
- Example data from Schloss lab that will be used with this tutorial. It was extracted from the very large full SFF file
- Lookup data for use with shhh.flows
- SILVA-based bacterial reference alignment
- mothur-formatted version of the RDP training set (v.9)
It is generally easiest to decompress these files and to then move the contents of the Trainset9_032012.pds and the silva.bacteria folders into the Schloss_SOP folder. You will also want to move the contents of the mothur executable folder there as well. If you are a sysadmin wiz (or novice) you can probably figure out how to put mothur in your path, but this will get you what you need for now.
In addition, you probably want to get your hands on the following...
- mono - if you are using Mac OS X or linux
- TextWranger / emacs / vi / or some other text editor
- R, Excel, or another program to graph data
- Adobe Illustrator, Safari, or Inkscape
- TreeView, FigTree, Topiary Explorer or another program to visualize dendrograms
It is generally easiest to use the "current" option for many of the commands since the file names get very long. Because this tutorial is meant to show people how to use mothur at a very nuts and bolts level, we will only selectively use the current option to demonstrate how it works. Generally, we will use the full file names.
Because of the large size of the original sff file (~800 MB) and that the data are not published, the folder you pulled down above does not contain the sff file. Instead we are giving you 21 of the 96 flow files that were outputted from the first several commands here. This tutorial will tell you when to pick up with the data in the folder...
As mentioned above we get data in the form of an sff file and will use mothur's implementation of sffinfo to extract the fasta, qual, and flow data from the binary file:
mothur > sffinfo(sff=GQY1XT001.sff, flow=T)
Sometimes it is not possible to get the sff file because your sequence provider is sketchy or because you are pulling data down from some other scientist's study who didn't provide the data. In those cases you may need to skip that step. Regardless, let's see what our sequences look like that are in the fasta file usig the summary.seqs command:
mothur > summary.seqs(fasta=GQY1XT001.fasta) Start End NBases Ambigs Polymer Minimum: 1 39 39 0 2 2.5%-tile: 1 116 116 0 4 25%-tile: 1 454 454 0 4 Median: 1 512 512 0 5 75%-tile: 1 539 539 0 5 97.5%-tile: 1 558 558 1 6 Maximum: 1 1040 1040 28 31 # of Seqs: 763373
This is the sequence profile for the entire half plate. In the flow files we have provided to you, there are a total of ~69000 sequences.
Reducing sequencing error
Our SOP will use the shhh.flows command, which is the mothur implementation of the PyroNoise component of the AmpliconNoise suite of programs. Others that don't have the computational capacity or don't have their flow data may want to jump ahead to the trim.seqs section of the tutorial. The rest of the SOP will assume that you are using the SOP approach.
To run shhh.flows, we need to process our flow data to do several things. First, we need to separate each flowgram according to the barcode and primer combination. We also need to make sure that our sequences are a minimum length and we need to cap the the number of flowgrams to that length. The default of trim.flows is 450 flows. If you are using GSFLX data you will probably want to play around with something in the order of 300-350 flows. We will also allow for 1 mismatch to the barcode and 2 mismatches to the primer. Running trim.flows will make use of the Oligos_File that is provided with the SOP Data that you pulled down from the wiki. You may want to change the number of processors to suit your hardware:
mothur > trim.flows(flow=GQY1XT001.flow, oligos=GQY1XT001.oligos, pdiffs=2, bdiffs=1, processors=2)
This will create 96 separate trim.flows and scrap.flows files as well as a file called GQY1XT001.flow.files, which holds the names of each of the trim.flows files. The folder you have pulled down has 21 of these flow files along with a file called GQY1XT001.flow.files. Next we will want to run shhh.flows to de-noise our sequence data. If you don't want take the time to run shhh.flows, you can use the processed files that are in the folder you pulled down above. shhh.flows can be run several ways depending on your hardware or operating system. The first way is to use the precomputed output for this dataset that came with the files you downloaded. Clearly this won't work for your own data :). Alternatively, we can stay within mothur and run the following:
mothur > shhh.flows(file=GQY1XT001.flow.files, processors=2)
It took me about 8 hours (wall time) to run shhh.flows on the 96 samples in GQY1XT001.flow.files with 8 processors. Note that if you are using data recently generated on the new 454 platform and chemistry there is a different flow order. If this is your case, you'll want to include order=B in trim.flows and shhh.flows. If you're using IonTorrent (why!?), you'll want to use order=I.
The two important files from running shhh.flows are the shhh.fasta and shhh.names. You will now feed these into trim.seqs to actually remove the barcode and primer sequences, make sure everything is 200 bp long, remove sequences with homoopolymers longer than 8 bp, and get the reverse complement for each sequence. This will also create a new names file, which maps the names of redundant sequences to a unique sequence and it will create a group file that indicates what group or sample each sequence came from:
mothur > trim.seqs(fasta=GQY1XT001.shhh.fasta, name=GQY1XT001.shhh.names, oligos=GQY1XT001.oligos, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, flip=T, processors=2)
Lets see what the new fasta data look like:
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.fasta, name=GQY1XT001.shhh.trim.names)
To demonstrate the current option, notice that you can also run this command as follows and get the same output:
mothur > summary.seqs() Start End NBases Ambigs Polymer NumSeqs Minimum: 1 247 247 0 3 1 2.5%-tile: 1 255 255 0 4 1725 25%-tile: 1 261 261 0 4 17244 Median: 1 266 266 0 4 34488 75%-tile: 1 274 274 0 5 51732 97.5%-tile: 1 292 292 0 6 67251 Maximum: 1 312 312 0 8 68975 Mean: 1 268.629 268.629 0 4.36483 # of unique seqs: 21886 total # of seqs: 68975
If you are ever curious what is "current" and what isn't, you can run the get.current command:
mothur > get.current()
Using quality scores
If for some reason you are unable to run shhh.flows with your data, a good alternative is to use the trim.seqs command using a 50-bp sliding window and to trim the sequence when the average quality score over that window drops below 35. Our results suggest that the sequencing error rates by this method are very good, but not quite as good as by shhh.flows and that the resulting sequences tend to be a bit shorter.
mothur > trim.seqs(fasta=GQY1XT001.fasta, oligos=GQY1XT001.oligos, qfile=GQY1XT001.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2)
If you are in a bind and do not have access to the quality score data the following will get you through...
mothur > trim.seqs(fasta=GQY1XT001.fasta, oligos=GQY1XT001.oligos, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, keepfirst=200, processors=2)
Basically what we've found is that the sequence quality really nose dives after 250 bp for Titanium data. Please send all complaints to 454, not me. Let's take a look at what the sequences look like from this approach:
mothur > summary.seqs(fasta=GQY1XT001.trim.fasta)
Again, you will have fewer sequences and they will be shorter than those obtained by the shhh.flows approach.
Processing improved sequences
Again, this SOP uses the output from the shhh.flows approach. If you used the trim.seqs approach, you will need to adjust the file names appropriately. The next step is to simplify the dataset by working with only the unique sequences. We aren't chucking anything here, we're just making the life of your CPU and RAM a bit easier. We will do this with the unique.seqs command. Because we already have a names file going, we will need to provide that as input with our fasta sequence:
mothur > unique.seqs(fasta=GQY1XT001.shhh.trim.fasta, name=GQY1XT001.shhh.trim.names)
Looking at the fasta data we see...
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.fasta, name=GQY1XT001.shhh.trim.unique.names) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 247 247 0 3 1 2.5%-tile: 1 255 255 0 4 1725 25%-tile: 1 261 261 0 4 17244 Median: 1 266 266 0 4 34488 75%-tile: 1 274 274 0 5 51732 97.5%-tile: 1 292 292 0 6 67251 Maximum: 1 312 312 0 8 68975 Mean: 1 268.629 268.629 0 4.36483 # of unique seqs: 14255 total # of seqs: 68975
We have basically simplified the dataset by about 5-fold. This means that a number of the downstream analyses will be 5 to 25 times faster than if we were working with the unique data. Next, we need to generate an alignment of our data using the align.seqs command by aligning our data to the SILVA-compatible alignment database reference alignment. I prefer the silva reference alignment, for the reasons I articulated in a recent PLoS Computational Biology paper that I published. For now, if your computer has less than 2 GB of RAM you should probably stick with the greengenes reference alignment and tell your PI to order you some more RAM.
mothur > align.seqs(fasta=GQY1XT001.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, processors=2)
Alternatively, you can run it as...
mothur > align.seqs(reference=silva.bacteria.fasta, processors=2)
I takes about 150 seconds to run on my MacBook Pro. Taking a look at the newly aligned sequences we see:
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.unique.names) Start End NBases Ambigs Polymer NumSeqs Minimum: 15695 27656 247 0 3 1 2.5%-tile: 16116 27659 255 0 4 1725 25%-tile: 16312 27659 261 0 4 17244 Median: 16352 27659 266 0 4 34488 75%-tile: 16421 27659 274 0 5 51732 97.5%-tile: 21280 27659 292 0 6 67251 Maximum: 21304 28432 312 0 8 68975 Mean: 17044.5 27660.5 268.629 0 4.36483 # of unique seqs: 14255 total # of seqs: 68975
We can see that pretty much all of our sequences end at position 27659 of the alignment space (the full alignment is 50,000 columns long). We also see that 97.5% of our sequences are at least 255 bp long.
To make sure that all of the sequences overlap in the same alignment space we need to remove those sequences that are outside the desired range using the screen.seqs command. There are several ways to run this command that are perfectly acceptable. The goal is to think about and then optimize the criteria so that you have as many long sequences as possible - these two factors are inversely related to each other. Our preferred approach is to set the start and end positions. These parameters indicate the position by which all sequences must start by and end after. Setting these will allow you to dictate the alignment coordinates of your sequences so that they all overlap. We prefer this to setting a minimum length because sequences (especially in the V1 region) vary in length when they cover the same region. We can do this as follows:
mothur > screen.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.unique.names, group=GQY1XT001.shhh.groups, end=27659, optimize=start, criteria=95, processors=2)
What this is doing is to require all sequences to end after position 27659 and it will select a start position such that 95% of your sequences start before that position. You can also "hard code" the start value or alter the criteria value to have more short sequences or fewer long sequences. Play with this a bit and see what options you think work best for your dataset.
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.good.align, name=GQY1XT001.shhh.trim.unique.good.names) Start End NBases Ambigs Polymer NumSeqs Minimum: 15695 27659 249 0 3 1 2.5%-tile: 16116 27659 255 0 4 1694 25%-tile: 16312 27659 262 0 4 16937 Median: 16352 27659 267 0 4 33874 75%-tile: 16421 27659 274 0 5 50810 97.5%-tile: 21280 27659 292 0 6 66053 Maximum: 21280 28432 312 0 8 67746 Mean: 16968.9 27660.6 268.91 0 4.37131 # of unique seqs: 14107 total # of seqs: 67746
We see that all of our sequences start by position 21280 and end after position 27659. Also, all of the sequences are at least 249 bp in length. The sequence that is 312 bp is suspicious, but oh well, let's move on. Next, we need to filter our alignment so that all of our sequences only overlap in the same region and to remove any columns in the alignment that don't contain data. We do this by running the filter.seqs command. Sometimes people run this and find that they don't have anything left; this is typically because of problems in the screen.seqs step above.
mothur > filter.seqs(fasta=GQY1XT001.shhh.trim.unique.good.align, vertical=T, trump=., processors=2)
In this command trump=. will remove any column that has a "." character, which indicates missing data. The vertical=T option will remove any column that contains exclusively gaps. We should see that our alignment is now 449 columns long. That's another pretty significant reduction in data that we have to process, plus it makes our analysis more robust. Because we forced everything to the same alignment space with no overhangs, we probalby have a number of sequences that are now redundant over this region. Let's run unique.seqs again to further simplify the dataset:
mothur > unique.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.fasta, name=GQY1XT001.shhh.trim.unique.good.names)
This will reduce the number of sequences from 14107 to 10467. The final step we can take to reduce our sequencing error is to use the pre.cluster command to merge sequence counts that are within 2 bp of a more abundant sequence. As a rule of thumb we use a difference of 1 bp per 100 bp of sequence length:
mothur > pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.names, group=GQY1XT001.shhh.good.groups, diffs=2)
This implementation of the command will split the sequences by group and then within each group it will pre-cluster those sequences that are with 1 or 2 bases of a more abundant sequence. It will then merge all the sequences back into one file that is outputted as GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta and a names file that is GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names. Let's see what we have left
mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 449 244 0 3 1 2.5%-tile: 1 449 251 0 4 1694 25%-tile: 1 449 254 0 4 16937 Median: 1 449 255 0 4 33874 75%-tile: 1 449 255 0 4 50810 97.5%-tile: 1 449 255 0 6 66053 Maximum: 2 449 261 0 8 67746 Mean: 1.00006 449 254.103 0 4.3462 # of unique seqs: 6831 total # of seqs: 67746
We still have the same total number of sequences, but we now have 6,831 unique sequences.
To this point we have been concerned with removing sequencing errors. The next thing we need to do is to identify chimeras. mothur has a number of methods for tools for detecting chimeras. Our preferred method is to use chimera.uchime using the sequences as their own reference:
mothur > chimera.uchime(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups, processors=2) ... It took 38 secs to check 8944 sequences. 4414 chimeras were found.
The above method will first split the sequences by group and then check each sequence within a group using the more abundant sequences as reference sequences. This enables us to parallelize the entire process by putting each group onto a separate processor. The upside of this approach is that you get to use your more abundant sequences as the reference database. The idea is that chimeras should be rarer than their more abundant parent sequences. By default, if chimera.uchime calls a sequence as chimeric in one group, it considers it a chimera in all samples and will flag all for removal.
mothur > remove.seqs(accnos=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.denovo.uchime.accnos, fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups, dups=T) Removed 10631 sequences from your name file. Removed 4414 sequences from your fasta file. Removed 10631 sequences from your group file.
These lines tell us that although there were 4414 unique sequences flagged as chimeras, they really represented close to 10,631 sequences that were all removed from further analysis. Let's see what we've got left...
mothur > summary.seqs(name=current) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 449 245 0 3 1 2.5%-tile: 1 449 251 0 4 1428 25%-tile: 1 449 253 0 4 14279 Median: 1 449 255 0 4 28558 75%-tile: 1 449 255 0 4 42837 97.5%-tile: 1 449 255 0 6 55688 Maximum: 2 449 261 0 8 57115 Mean: 1.00002 449 254.069 0 4.33676 # of unique seqs: 2417 total # of seqs: 57115
There is one more step we can take to improve our data quality. Because chloroplasts and mitochondria are generally considered to be "former" bacteria and not an active component to a microbial community, the next thing we will do is to remove sequences affiliated with organelles from our dataset. To do this, we first need to classify our sequences using the mothur version of the "Bayesian" classifier. We can do this with the classify.seqs command using the RDP reference files you downloaded above:
mothur > classify.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.names, group=GQY1XT001.shhh.good.pick.groups, template=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80, processors=2)
Now let's use the remove.lineage command to remove those sequences that classified as "Chloroplast", "Mitochondria", or "unknown" (those sequences that could not be classified at the Kingdom level) as well as archaeal and eukaryotic 16S/18S rRNAs, since our primers are not designed to amplify these populations:
mothur > remove.lineage(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.names, group=GQY1XT001.shhh.good.pick.groups, taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown) mothur > summary.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 449 245 0 3 1 2.5%-tile: 1 449 251 0 4 1428 25%-tile: 1 449 253 0 4 14279 Median: 1 449 255 0 4 28557 75%-tile: 1 449 255 0 4 42835 97.5%-tile: 1 449 255 0 6 55686 Maximum: 2 449 261 0 8 57113 Mean: 1.00002 449 254.069 0 4.33675 # of unique seqs: 2415 total # of seqs: 57113
While not a huge reduction, we see that there was 1 Chloroplast and 1 Mitochondrial sequence in the dataset. Later we will come back and assess how good our data actually look, but let's kick that down the road and get going with the analysis. If you have a good reason to think that some sequences are contaminants, you can use a similar approach to remove those.
Before we get rolling with the actual analysis of the data, let's take a gander at the quality of our data to this point. In the folder you pulled down was a called HMP_MOCK.v35.align which is a full alignment of the sequences that were put into the mock community and sequenced. Because our sequences have been filtered, we need to use the filter that was generated when we ran filter.seqs previously to filter this alignment. We do this as follows:
mothur > filter.seqs(fasta=HMP_MOCK.v35.align, hard=GQY1XT001.filter)
Next, we need to get the sequences that correspond to our mock community (the group name is MOCK.GQY1XT001):
mothur > get.groups(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta, group=GQY1XT001.shhh.good.pick.pick.groups, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names, groups=MOCK.GQY1XT001)
Now we can use the seq.error command to get the error rate for those sequences by comparing our sequences to the references:
mothur > seq.error(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.names, reference=HMP_MOCK.v35.filter.fasta, processors=2) Overall error rate: 6.5627e-05 Errors Sequences 0 5912 1 7 2 3 3 12 4 5 5 3 6 1 7 0 8 0 9 1 It took 1 secs to check 96 sequences.
That's right, we've reduced the sequencing error rate from ~0.8% to 0.007%. It should be added that this is probably a conservative estimate of error since some chimeras may have slipped through in calculating the error rate (chimeras are not sequencing errors).
Now let's see how many chimeras seq.error threw out as being chimeric. You might have to open final.pick.error.chimera in a spreadsheet...
mothur > system(grep -c "2$" GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.error.chimera) 53
That means that of the remaining 96 unique sequences, at least 56 of them are true chimeras that we were not able to detect. Look for future versions of mothur to report the sensitivity, specificity, and false detection rate for chimera checking.
Next, we want to see how many OTUs and phylotypes there should be and how many we observed in the mock community data. To do this we need to create a file either using command line tools or a spreadsheet. What we want is a file called mock.accnos that contains the unique sequence names contained in column 2 of the file ending "error.summary". This is important because, just because you put a sequence in a tube doesn't mean that it got sequenced. Here's how I did this with shell commands...
system(cut -f 2 *error.summary | sort | uniq > mock.accnos)
Open the mock.accnos file in a text editor and remove the bottom line - "reference" - from the file. Save and close. Returning to mothur, let's get these sequecnes from the filtered fasta sequence data and calculate OTUs:
mothur > get.seqs(accnos=mock.accnos, fasta=HMP_MOCK.v35.filter.fasta) mothur > dist.seqs(fasta=HMP_MOCK.v35.filter.pick.fasta, output=lt) mothur > cluster(phylip=HMP_MOCK.v35.filter.pick.phylip.dist)
Looking at the output files, we see that if there were no chimeras and no sequencing errors, there should only be 18 OTUs at a 0.03 cutoff. Now we'll see how many OTUs there were in the mock community dataset:
mothur > dist.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.10) mothur > cluster(column=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.dist, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.names) mothur > summary.single(calc=sobs, label=0.03) mothur > system(cat GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.an.summary)
This tells us that there were a total of 29 OTUs - 11 of these are the product of undetected chimeras and sequencing errors. Considering the large number of remaining chimeras not detected by chimera.uchime this isn't so bad. Regardless, it still kinda sucks. If we sub-sample to 4405 sequences (see below), we get...
mothur > summary.single(calc=sobs, label=0.03, subsample=4419) mothur > system(cat GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.an.ave-std.summary) label method sobs 0.03 ave 26.227000
This indicates that when we rarefy down to 4419 sequences, we lose 3 spurious OTUs.
So now we know how good (or bad) our data are. If you wanted you could start back at the step after the sffinfo command and start messing around and seeing how much this output changes. I've done this for 90 mock communities. What you've been doing is what I came up with. Clearly there is room for improvement in the area of chimera prevention and detection, but I'd say that we've got the actual sequencing error down pretty well. I'll add that the only way we can do this type of analysis is because we dedicated one barcode to a mock community. If we were sequencing on numerous plates we could see whether there was variation in the error rates or in the relative abundance of the OTUs/phylotypes. Finally, if your PI (or sequencing center) freaks out that you "threw away" a bunch of data, feel free to reanalyze these data however you (or he/she) see fit and show them how it changes the output.
Preparing inputs for analysis
Let's remove the mock community data so we're only using our "real" data.
mothur > remove.groups(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta, group=GQY1XT001.shhh.good.pick.pick.groups, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names, taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, groups=MOCK.GQY1XT001)
We pride ourselves on generating files with the longest possible file names. I mean who doesn't love typing out GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta final.fasta? Let's use the mothur System command to make copies of some files so that we don't develop carpel tunnel syndrome. If you are in mac/linux do this:
mothur > system(cp GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.fasta final.fasta) mothur > system(cp GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.names final.names) mothur > system(cp GQY1XT001.shhh.good.pick.pick.pick.groups final.groups) mothur > system(cp GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy final.taxonomy)
Note that "cp" is the copy command for mac and unix users. If your world is run by Bill Gates, you will want to use the "copy" command instead:
mothur > system(copy GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.fasta final.fasta) mothur > system(copy GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.pick.names final.names) mothur > system(copy GQY1XT001.shhh.good.pick.pick.pick.groups final.groups) mothur > system(copy GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy final.taxonomy)
To get ready for doing the fun part of analysis, we need to remember that there are three general approaches to analyzing 16S rRNA gene sequences. You can analyze sequences binned into OTUs or phylotypes or you can use the phylogenetic tree-based approaches.
Option 1 To build OTUs within mothur there are several options. The SOP in the Schloss lab is to first generate a distance matrix using the dist.seqs command. We will use a cutoff of 0.15, which means we'll chuck any pairwise distance larger than 0.15:
mothur > dist.seqs(fasta=final.fasta, cutoff=0.15, processors=2)
If the output of this command is in the >100 GBs of memory range you have probably done something incorrectly above or chosen to ignore some of my advice... Next, we want to cluster these sequences into OTUs based on the newly created distance matrix and the names file we've been keeping track of along the way using the cluster command:
mothur > cluster(column=final.dist, name=final.names)
Typically we are interested in cutoffs of 0.03 for further analysis.
Option 2 Generally, I'll set up the cluster command and let it rip overnight and all will be well in the morning. If I'm in a pinch (e.g. I have to give a talk in an hour and still don't have my slides done), I'll use the cluster.split command. It is wicked fast and the only reason I don't put it in the SOP is because I think that sometimes things that are hard are better. Regardless, if you use a taxlevel of 2-3 (Phylum or Class) the output should be about the same and you won't owe Gaia any carbon credits:
mothur > cluster.split(fasta=final.fasta, taxonomy=final.taxonomy, name=final.names, taxlevel=3, processors=4)
Like I said above, the output from Option 2 should be about the same as Option 1. The remainder of this tutorial uses Option 1. Now that we have a list file from one of these options, we would like to create a table that indicates the number of times an OTU shows up in each sample. This is called a shared file and can be created using the make.shared file:
mothur > make.shared(list=final.an.list, group=final.groups, label=0.03)
The final step to getting good OTU data is to normalize the number of sequences in each sample. We have observed that the number of spurious OTUs is correlated with the number of sequences. Because of this, we standardize the number of sequences across samples so that they all have the same opportunity to be "wrong". Most of these spurious OTUs are chimeras that we could not detect. To do this we will pick a minimum number of sequences that a sample has and randomly select this number of sequences from each sample. This is not the ideal solution because you could get lucky/unlucky in how you pick sequences (filed under things to worry about in the middle of the night if you've run out of other things to worry about). First we need to know how many sequences are in each step. You can do this with the count.groups command:
mothur > count.groups()
From the output we see that the sample with the fewest sequences had 4419 sequences in it. That seems like a reasonable size, so let's sub-sample all the samples to this number of seqeunces:
mothur > sub.sample(shared=final.an.shared, size=4419)
Typically, this will remove a few groups that that didn't have enough sequences. Get on the horn and yell at the sequencing center or read a book about the probability of getting 96 evenly represented samples with 5000 sequences each.
The last thing we'd like to do is to get the taxonomy information for each of our OTUs. To do this we will use the classify.otu command to give us the majority consensus taxonomy:
mothur > classify.otu(list=final.an.list, name=final.names, taxonomy=final.taxonomy, label=0.03)
With the completion of that last step, I generally have no reason to assign sequences to phylotypes. We do the following steps to make clinicians and people that think the names mean something happy. The phylotype command goes through the taxonomy file and bins sequences together that have the same taxonomy. Here we do it to the genus level:
mothur > phylotype(taxonomy=final.taxonomy, name=final.names, label=1)
Like above, we want to make a shared file and standardize the number of sequences in each group:
mothur > make.shared(list=final.tx.list, group=final.groups, label=1) mothur > sub.sample(shared=final.tx.shared, size=4419)
Finally, just to keep things consistent, we get the taxonomy of each phylotype:
mothur > classify.otu(list=final.tx.list, name=final.names, taxonomy=final.taxonomy, label=1)
There are many ways to construct phylogenetic trees. We have ported clearcut into mothur to generate neighbor joining trees. By default we do not use their heuristic, so these are real neighbor joining trees which you may or may not think are "real". First we need to subsample the sequences from each group and then construct a phylip-formatted distance matrix, which we calculate with dist.seqs:
mothur > dist.seqs(fasta=final.fasta, output=phylip, processors=2)
Now we call on clearcut:
mothur > clearcut(phylip=final.phylip.dist)
Moving on, let's do something more interesting and actually analyze our data. Remember that our initial question had to do with the stability and change in community structure in these samples when comparing early and late samples. Keep in mind that the group names have either a F or M (sex of animal) followed by a number (number of animal) followed by a D and a three digit number (number of days post weaning).
Let's start our analysis by analyzing the alpha diversity of the samples. First we will generate collector's curve of the Chao1 richness estimators and the inverse Simpson diversity index. To do this we will use the collect.single command:
mothur > collect.single(shared=final.an.shared, calc=chao-invsimpson, freq=100)
This command will generate file ending in *.chao and *.invsimpson for each sample, which can be plotted in your favorite graphing software package. When you look at these plots you will see that the Chao1 curves continue to climb with sampling; however, the inverse Simpson diversity indices are relatively stable. In otherwords, it isn't really possible to compare the richness of these communities based on the Chao1 index, but it is with the inverse Simpson index. As a quick aside, it is important to point out that Chao1 is really a measure of the minimum richneess in a community, not the full richness of the community. One method often used to get around this problem is to look at rarefaction curves describing the number of OTUs observed as a function of sampling effort. We'll do this with the rarefaction.single command:
mothur > rarefaction.single(shared=final.an.shared, calc=sobs, freq=100)
This will generate files ending in *.rarefaction, which again can be plotted in your favorite graphing software package. Alas, rarefaction is not a measure of richness, but a measure of diversity. If you consider two communities with the same richness, but different evenness then after sampling a large number of individuals their rarefaction curves will asymptote to the same value. Since they have different evennesses the shapes of the curves will differ. Therefore, selecting a number of individuals to cutoff the rarefaction curve isn't allowing a researcher to compare samples based on richness, but their diversity. Finally, let's get a table containing the number of sequences, the sample coverage, the number of observed OTUs, and the invsimpson diversity estimate using the summary.single command. To standardize everything, let's randomly select 4419 sequences from each sample 1000 times and calculate the average:
mothur > summary.single(calc=nseqs-coverage-sobs-invsimpson, subsample=4419)
These data will be outputted to a table in a file called final.an.groups.ave-std.summary. Interestingly, the sample coverages were all above 97%, indicating that we did a pretty good job of sampling the communities. Plotting the richness or diversity of the samples would show that there was little difference between the different animals or between the early and late time points. You could follow this up with a repeated-measures ANOVA and find that there was no significant difference based on sex or early vs. late.
Beta diversity measurements
Now we'd like to compare the membership and structure of the various samples using an OTU-based approach. Let's start by generating a heatmap of the relative abundance of each OTU across the 24 samples using the heatmap.bin command and log2 scaling the relative abundance values. Because there are so many OTUs, let's just look at the top 50 OTUs:
mothur > heatmap.bin(shared=final.an.shared, scale=log2, numotu=50)
This will generate an SVG-formatted file that can be visualized in Safari or manipulated in graphics software such as Adobe Illustrator. Needless to say these heatmaps can be a bit of Rorshock. A legend can be found at the bottom left corner of the heat map.
Now let's calculate the similarity of the membership and structure found in the various samples and visualize those similarities in a heatmap with the Jaccard and thetayc coefficients. We will do this with the heatmap.sim command:
mothur > heatmap.sim(calc=jclass-thetayc)
The output will be in two SVG-formatted files called final.an.0.03.jclass.heatmap.sim.svg and final.an.0.03.thetayc.heatmap.sim.svg. In all of these heatmaps the red colors indicate communities that are more similar than those with black colors.
When generating Venn diagrams we are limited by the number of samples that we can analyze simultaneously. Let's take a look at the Venn diagrams for the first 4 time points of female 3 using the venn command:
mothur > venn(groups=F003D000-F003D002-F003D004-F003D006)
This generates an two 4-way Venn diagrams. This shows that there were a total of 360 OTUs observed between the 4 time points. Only 73 of those OTUs were shared by all four time points. We could look deeper at the shared file to see whether those OTUs were numerically rare or just had a low incidence.
Let's use two non-parametric estimators to see what the predicted minimum number of overlapping OTUs is for the same samples using the summary.shared command:
mothur > summary.shared(calc=sharedchao, groups=F003D000-F003D002-F003D004-F003D006, all=T)
Opening the final.an.0.03subsample.0.03.pick.sharedmultiple.summary file we see a prediction that over these four time points the core microbiome contained at least 84 OTUs.
mothur > tree.shared(calc=thetayc-jclass, subsample=4419)
This command generates two newick-formatted tree files - final.an.thetayc.0.03.ave.tre and final.an.jclass.0.03.ave.tre - that are the result of subsampling 4419 sequences 1000 times. The trees can be visualized in software like TreeView or FigTree. Inspection of the both trees shows that individuals' communities cluster with themselves to the exclusion of the others. We can test to deterine whether the clustering within the tree is statistically significant or not using by choosing from the parsimony, unifrac.unweighted, or unifrac.weighted commands. To run these we will first need to create a design file that indicates which treatment each sample belongs to. There is a file called mouse.sex_time.design in the folder you downloaded that looks vaguely like this:
F003D000 F003Early F003D002 F003Early F003D004 F003Early F003D006 F003Early F003D008 F003Early F003D142 F003Late F003D144 F003Late F003D146 F003Late F003D148 F003Late F003D150 F003Late
Using the parsimony command let's look at the pairwise comparisons. Specifically, let's focus on the early vs. late comparisons for each mouse:
mothur > parsimony(tree=final.an.thetayc.0.03.ave.tre, group=mouse.sex_time.design, groups=all) 1 F003Early-F003Late 1 0.009
There was clearly a significant difference between the clustering of the early and late time points. Recall that this method ignores the branch length. Let's incorporate the branch length using the unifrac commands:
mothur > unifrac.weighted(tree=final.an.thetayc.0.03.ave.tre, group=mouse.sex_time.design, random=T)
Tree# Groups WScore WSig 1 F003Early-F003Late 0.861245 <0.001
Clearly when we incorporate the branch length from the dendrogram and incorporate the weighting, the early and late time series are significantly different for each mouse. Let's do the same analysis but without correcting for the weightings:
mothur > unifrac.unweighted(tree=final.an.thetayc.0.03.ave.tre, group=mouse.sex_time.design, random=T, groups=all) Tree# Groups UWScore UWSig 1 F003Early-F003Late 0.965069 0.007
Here we see a similar result to what we observed for the weighted UniFrac.
Another popular way of visualizing beta-diversity information is through ordination plots. We can calculate distances between samples using the dist.shared command:
mothur > dist.shared(shared=final.an.shared, calc=thetayc-jclass, subsample=4419)
The two resulting distance matrices (i.e. final.an.thetayc.0.03.lt.ave.dist and final.an.jclass.0.03.lt.ave.dist) can then be visualized using the pcoa or nmds plots. Principal Coordinates (PCoA) uses an eigenvector-based approach to represent multidimensional data in as few dimesnsions as possible. Our data is highly dimensional (~9 dimensions).
mothur > pcoa(phylip=final.an.thetayc.0.03.lt.ave.dist) mothur > pcoa(phylip=final.an.jclass.0.03.lt.ave.dist)
The output of these commands are three files ending in *dist, *pcoa, and *pcoa.loadings. The final.an.thetayc.0.03.lt.ave.pcoa.loadings file will tell you what fraction of the total variance in the data are represented by each of the axes. In this case the first and second axis represent about 32 and 22% of the variation (54% of the total) for the thetaYC distances. The output to the screen indicates that the R-squared between the original distance matrix and the distance between the points in 2D PCoA space was 0.54, but that if you add a third dimension the R-squared value increases to 0.97. All in all, not bad.
Alternatively, non-metric multidimensional scaling (NMDS) tries to preserve the distance between samples using a user defined number of dimensions. We can run our data through NMDS with 2 dimensions with the following commands
mothur > nmds(phylip=final.an.thetayc.0.03.lt.ave.dist) mothur > nmds(phylip=final.an.jclass.0.03.lt.ave.dist)
Opening the final.an.thetayc.0.03.lt.ave.nmds.stress file we can inspect the stress and R2 values, which describe the quality of the ordination. Each line in this file represents a different iteration and the configuration obtained in the iteration with the lowest stress is reported in the final.an.thetayc.0.03.lt.ave.nmds.axes file. In this file we find that the lowest stress value was 0.13 with an R-squared value of 0.91; that stress level is actually pretty good. You can test what hapens with three dimensions by the following:
mothur > nmds(phylip=final.an.thetayc.0.03.lt.ave.dist, mindim=3, maxdim=3)
The stress value drops to 0.05 and the R2 value goes up to 0.98. Not bad. In general, you would like a stress value below 0.20 and a value below 0.10 is even better. Thus, we can conclude that, NMDS is better than PCoA. We can plot the three dimensions of the NMDS data by plotting the contents of final.an.subsample.pick.thetayc.0.03.lt.nmds.axes. Again, it is clear that the early and late samples cluster separately from each other. Ultimately, ordination is a data visualization tool. We might ask if the spatial separation that we see between the early and late plots in the NMDS plot is statistically significant. To do this we have two statistical tools at our disposal. The first analysis of molecular variance (amova), tests whether the centers of the clouds representing a group are more separated than the variation among samples of the same treatment. This is done using the distance matrices we created earlier and does not actually use ordination.
mothur > amova(phylip=final.an.thetayc.0.03.lt.ave.dist, design=mouse.sex_time.design) F003Early-F003Late Among Within Total SS 0.171226 0.307108 0.478334 df 1 8 9 MS 0.171226 0.0383885 Fs: 4.46036 p-value: 0.009*
Here we see from the AMOVA that the "cloud" early and late time points has a significantly different centroid for this mouse. Thus, the observed separation in early and late samples is statistically significant.
Next, we might ask which OTUs are responsible for shifting the samples along the two axes. We can determine this by measuring the correlation of the relative abundance of each OTU with the two axes in the NMDS dataset. We do this with the corr.axes command:
mothur > corr.axes(axes=final.an.thetayc.0.03.lt.ave.nmds.axes, shared=final.an.0.03.subsample.shared, method=spearman, numaxes=3)
This command generates the final.an.0.03subsample.0.03.pick.spearman.corr.axes file. The data for the first five OTUs look like this...
OTU axis1 p-value axis2 p-value axis3 p-value length Otu001 -0.503030 0.131276 0.103030 0.757252 -0.927273 0.000112 1.059948 Otu002 -0.187879 0.573002 0.151515 0.649436 -0.878788 0.000814 0.911331 Otu003 0.369697 0.267391 -0.806061 0.004862 0.139394 0.675814 0.897686 Otu004 0.890909 0.000542 -0.030303 0.927565 0.321212 0.335228 0.947531 Otu005 0.284848 0.392803 0.321212 0.335228 -0.236364 0.478268 0.490085 Otu006 0.393939 0.237278 0.224242 0.501121 0.018182 0.956501 0.453656 Otu007 0.018182 0.956501 0.684848 0.028883 -0.406061 0.223155 0.796388 Otu008 -0.127273 0.702596 0.515152 0.122236 -0.042424 0.898725 0.532334 Otu009 0.212121 0.524539 -0.369697 0.267391 -0.115152 0.729753 0.441510
What these results show is that OTUs 1 and 2 are responsible for moving points in a negative direction along axis 3, whereas OTU 3 moves it along the negative direction on axis 2. Recalling that we classified each OTU earlier, we can open final.an.0.03.cons.taxonomy to see that these first five OTUs are mainly members of the Porphyromonadaceae. This helps to illustrate the power of OTUs over phylotypes since each of these OTUs is behaving differently. These data can be plotted in what's known as a biplot where lines radiating from the origin (axis1=0, axis2=0, axis3=0) to the correlation values with each axis are mapped on top of the PCoA or NMDS plots. Later, using the metastats command, we will see another method for describing which populations are responsible for differences seen between specific treatments.
An alternative approach to building a biplot would be to provide data indicating metadata about each sample. For example, we may know the weight, height, blood pressure, etc. of the subjects in these samples. For discussion purposes the file mouse.dpw.metadata is provided and looks something like this:
group age F003D000 0 F003D002 2 F003D004 4 ...
mothur > corr.axes(axes=final.an.thetayc.0.03.lt.ave.nmds.axes, metadata=mouse.dpw.metadata, method=spearman, numaxes=3)
Opening the file mouse.dpw.spearman.corr.axes, we see:
Feature axis1 p-value axis2 p-value axis3 p-value length dpw 0.709091 0.021666 -0.624242 0.053718 0.357576 0.283394 1.010123
Indicating that as the dpw increases the communities shift to in the positive direction along axis 1 and negatively along the axis 2.
In addition to the use of corr.axes we can also use metastats to determine whether there are any OTUs that are differentially represented between the samples from men and women in this study. Run the following in mothur:
mothur > metastats(shared=final.an.0.03.subsample.shared, design=mouse.sex_time.design)
Looking in the first 10 OTUs from final.an.0.03subsample.0.03.pick.0.03.F003Late-F003Early.metastats file we see the following...
OTU mean(group1) variance(group1) stderr(group1) mean(group2) variance(group2) stderr(group2) p-value q-value Otu001 0.096447 0.000336 0.008200 0.144105 0.004666 0.030548 0.164000 0.205631 Otu002 0.106811 0.000761 0.012341 0.107807 0.001182 0.015374 0.942650 0.439613 Otu003 0.127540 0.000184 0.006061 0.070152 0.000438 0.009362 0.000514 0.035404 Otu004 0.064449 0.000337 0.008206 0.064223 0.002877 0.023986 0.951729 0.439613 Otu005 0.055171 0.000044 0.002960 0.066124 0.000209 0.006463 0.151736 0.205631 ...
These data tell us that OTU 3 was significantly different between the early and late samples.
Phylotype-based analysis is the same as OTU-based analysis, but at a different taxonomic scale. We will leave you on your own to replicate the OTU-based analyses described above with the phylotype data.
OTU and phylotype-based analyses are taxonomic approaches that depend on a binning procedure. In contrast, phylogeny-based approaches attempt similar types of analyses using a phylogenetic tree as input instead of a shared file. Because of this difference these methods compare the genetic diversity of different communities.
When using phylogenetic methods, alpha diversity is calculated as the total of the unique branch length in the tree. This is done using the phylo.diversity command. Because of differences in sampling depth we will rarefy the output:
mothur > phylo.diversity(tree=final.phylip.tre, name=final.names, group=final.groups, rarefy=T)
This will generate a file called final.phylip.1.phylodiv.rarefaction.
The unifrac-based metrics are used to assess the similarity between two communities membership (unifrac.unweighted) and structure (unifrac.weighted). We will use these metrics and generate PCoA plots to compare our samples. There are two beta-diversity metrics that one can use - unweighted and weighted. We will also have mothur subsample the trees 1000 times and report the average:
mothur > unifrac.unweighted(tree=final.phylip.tre, name=final.names, group=final.groups, distance=lt, processors=2, random=F, subsample=4419) mothur > unifrac.weighted(tree=final.phylip.tre, name=final.names, group=final.groups, distance=lt, processors=2, random=F, subsample=4419)
These commands will distance matrices (final.phylip.tre1.unweighted.ave.dist and final.phylip.tre1.weighted.ave.dist) that can be analyzed using all of the beta diversity approaches described above for the OTU-based analyses.
There are many other ways that one could analyze these data (hey, we've gotta save something for the paper!). I encourage you to go back and change the settings, use different calculators, come up with a hypothesis using the data and test it. If you think of an analysis that you wish mothur would do, please let us know and we'll see about adding it to the package. There is a certain "pipeline" aspect to this analysis; however, it is also very much an art of working with sequences. If you want to basically do everything that was described above, you can use the batch file and use mothur in the batch mode as follows:
$ ./mothur batch
The file "batch" would contain each line you used above on its own line in a text file