We will be offering an R workshop December 18-20, 2019. Learn more.
This goal of this tutorial is to demonstrate the standard operating procedure (SOP) that the Schloss lab uses to process their 16S rRNA gene sequences that are generated using Illumina's MiSeq platform using paired end reads. The approach we take is to use index reads to multiplex a large number of samples (i.e. 384) on a single run. Others have generated similar data but without the index reads and so the index (aka barcode) sequences are found at the beginning of each read. This SOP will highlight the differences in processing between these two approaches. This SOP is largely the product of a series of manuscripts that we have published and users are advised to consult these for more details and background data. The MiSeq-specific steps are described in a manuscript that is in review with Applied and Environmental Microbiology. The workflow is being divided into several parts shown here in the table of contents for the tutorial:
The first step of the tutorial is to understand how we typically set up a plate of barcodes for sequencing. First, you can obtain a set of barcodes from multiple sources. We utilize the "Broad" barcodes that were used for the HMP 16S rRNA gene sequencing effort. You can find these barcodes and primers to sequence the V13, V35, and V69 regions online. Next, it is worth noting that the sequencing is done in the "reverse direction" - e.g. starting at the 3' end of the V5 region, we sequence back towards the 5' end of the V3 region. We picked the V35 region because we liked these primers the best for the types of samples that we generally process (i.e. human and mouse). You may choose to use another region based on the expected biodiversity of your samples. Second, this set of barcodes will allow us to simultaneously sequence 96 samples. A priori, we dedicate two of these barcodes to controls. The first is for resequencing a mock community made up of a defined consortia of 16S rRNA gene sequences where we know the true sequence. This is helpful in assessing sequencing errors and drift in the process. The second is a "generous donor" sample, which is a representative sample (e.g. stool) that we resequence on each plate. This provides a more realistic understanding of how processing drift might be affecting our analysis. Finally, we generally utilize half of a 454 plate for each set of 96 barcodes. A typical plate has two halves (duh) and it is not advisable to put the same barcode-primer combination on these two halves if they are used for different samples as there is leakage across the gasket.
Starting out we need to first determine, what was our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial and archaeal strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.
When we get our sequences back they arrive as an sff file. Often times people will only get back a fasta file or they may get back the fasta and qual files. You really need the sff file or at least the fasta, qual, and flow file data. If your sequence provider can't or won't provide these data find a new sequence provider. It is also critical to know the barcode and primer sequence used for each sample. We have heard reports from many users that their sequence provider has already trimmed and quality trimmed the data for them. Ahemm... don't trust them. If your sequence provider won't give you your primer or barcode sequences, move on and use someone else. For this tutorial you will need several sets of files. To speed up the tutorial we provide some of the downstream files that take awhile to generate (e.g. the output of shhh.flows):
- Example data from Schloss lab that will be used with this tutorial. It was extracted from the very large full SFF file
- Lookup data for use with shhh.flows
- SILVA-based bacterial reference alignment
- mothur-formatted version of the RDP training set (v.9)
It is generally easiest to decompress these files and to then move the contents of the Trainset9_032012.pds and the silva.bacteria folders into the Schloss_SOP folder. You will also want to move the contents of the mothur executable folder there as well. If you are a sysadmin wiz (or novice) you can probably figure out how to put mothur in your path, but this will get you what you need for now.
In addition, you probably want to get your hands on the following...
- mono - if you are using Mac OS X or linux
- TextWranger / emacs / vi / or some other text editor
- R, Excel, or another program to graph data
- Adobe Illustrator, Safari, or Inkscape
- TreeView, FigTree, Topiary Explorer or another program to visualize dendrograms
It is generally easiest to use the "current" option for many of the commands since the file names get very long. Because this tutorial is meant to show people how to use mothur at a very nuts and bolts level, we will only selectively use the current option to demonstrate how it works. Generally, we will use the full file names.