We will be offering mothur and R workshops throughout 2019. Learn more.
Marine community analysis
In this tutorial we will perform an analysis of marine microbial communities using the tools available in mothur. Specifically, we will be looking at fourteen sites from the Global Ocean Sampling Expedition (GOS) Expedition which cover a diverse set of marine environments (e.g., coastal, fresh water, estuarine, hypersaline). All the data files you need to complete this tutorial are contained in the GOS.zip (~8MB) archive.
- 1 Getting started
- 2 OTU-based analyses
- 3 Hypothesis testing approaches
Below I describe a pipeline for analyzing the taxonomic composition of samples obtained during the GOS Expedition. The results of each step in this pipeline are provided so you can start exploring this dataset from whichever point in the pipeline you wish.
Obtaining 16S proxy sequences
The aim of the GOS Expedition was to provide a comprehensive genomic survey of the diversity of microbial life in the world's oceans. To achieve this goal, approximately 150 metagenomic marine samples were collected from around the world. This has provided a wealth of information about both the taxonomic and metabolic compositions of marine communities.
However, limitations in environmental sequencing technology can make working with metagenomic samples challenging. In this tutorial, we will investigate one possible pipeline for analyzing the taxonomic composition of environmental samples. Non-coding RNA (ncRNA) sequences for many of the GOS samples sites along with geographic and environmental data about these sites can be obtained from the CAMERA web server. We can identify 16S sequences within this set of ncRNA sequences by:
- Performing a blastall search with default parameters between the ncRNA sequences and the 16S rDNA database compiled by GreenGenes. Other 16S databases could just as easily be used. The results shown here used the GreenGenes database compiled on Jan. 28, 2009.
We can filter these results so we are left with only those sequences we are confident in by:
- Removing any blastall hits with an expectation value greater than 1.0E-5, an alignment length less than 50, or a percent identity less than 70%.
An interesting challenge that results from metagenomic data is the relatively short sequence reads. The mean length in our dataset is ~560 nt. This is significantly shorter than those considered in the Esophageal community analysis where the mean length is ~865 nt. Aligning short sequence reads is problematic since there may not be sufficient overlap between sequences to build an alignment which can reliably be used for phylogenetic inference. We can circumvent this problem by:
- Using the best blast match in the GreenGenes database to each of our filtered 16S GOS sequences as a proxy for the short sequence read. Most sequences in the GreenGenes database are > 1250 nt.
This approach is not without its limitations. Most notably, it is highly sensitive to the quality and diversity of sequences contained within the GreenGenes database. Despite these limitations, this pipeline can reveal many interesting patterns amongst the sample sites contained in the GOS dataset. To keep computational requirements within reason, we will investigate a subset of the GOS dataset. We will investigate the following fourteen sites which cover a range of environment types:
- GS011 - an estuarine site
- GS012 - an estuarine site
- GS013 - a coastal site from the Northern Atlantic
- GS014 - a coastal site from the Northern Atlantic
- GS018 - a coastal site from the Caribbean
- GS019 - an open ocean site from the Caribbean
- GS020 - a fresh water site (Lake Gutan)
- GS022 - an open ocean site from the Eastern Tropics
- GS023 - an open ocean site from the Eastern Tropics
- GS027 - a coastal site from the Galapagos Islands
- GS028 - a coastal site from the Galapagos Islands
- GS033 - a hypersaline site from the Galapagos Islands
- GS121 - an open ocean site from the Indian Ocean
- GS122a - an open ocean site from the Indian Ocean
The proxy sequences for these sites are contained in the file gos_proxy_seq.fasta. The GOS archive also contains the file gos.group which indicates the sample each sequence is from and gos.names which indicates what sequences are identical.
Generating an alignment
We will obtain a multiple sequence alignment (MSA) of our 16S sequences in the same manner as specified for the Esophageal community analysis. Later we will use this MSA to obtain the distance matrix and phylogenetic tree which are required to perform most of the analyses we would like to do in mothur. For this tutorial we will use GreenGenes to align our sequences, but you could certainly use SILVA as an alternative. The output of the GreenGenes NAST alignment is included in the GOS archive (gos_proxy_seq_NAST_job3353.tgz), but you can follow these instructions if you wish to perform the alignment yourself:
- Go to the main GreenGenes webpage and click on the "Align" icon
- Click on the "Choose File" button and select gos_proxy_seq.fasta
- In the field for "Batch size for NAST" enter "50"
- Change the "Minimum Length" field to "600"
- Enter your email address and press the "Process FASTA File" button.
Do not close the webpage until you have receive an e-mail from Todd DeSantis. This email will have a file that resembles the gos_proxy_seq_NAST_job3353.tgz archive in the GOS archive folder. Decompress this file. Feel free to take a look at the excel spreadsheet which provides information about the quality of the alignment. The file gos_proxy_seq_NAST.fasta contains your MSA in a canonical 7,682 nt long format used by GreenGenes.
Creating a distance matrix
We will use GreenGenes to generate a distance matrix. GreenGenes provides a user friendly interface to the PHYLIP program DNAdist. It also conveniently masks out hyper-variable regions from the alignment file which is desirable since our sequences cover a significant range of phylogenetic diversity so these sites are unlikely to contain useful phylogenetic signal. Downstream analysis with mothur and many other programs generally requires a rooted tree so we need to add an outgroup sequence to our sequence file. We will use the 16S gene of Haloquadratum walsbyi (an archaeon) as our outgroup. This sequence should be aligned using the GreenGenes NAST aligner in the same manner as described above. To build the distance matrix:
- Copy-and-paste the aligned Haloquadratum walsbyi sequence (contained in the archive walsbyi_NAST_job31369.tgz in the GOS archive) into the top of your gos_proxy_seq_NAST.fasta sequence file
- Go to the main GreenGenes webpage and click on the "More tools..." link under the "Functions" menu
- Click on the "Create distance matrix" link
- Click on the "Browse" button and select your gos_proxy_seq_NAST.fasta file
- Enter your email address and press the "Calculate Distances" button.
Do not close the webpage until you have receive an e-mail from Todd DeSantis. This email will give you directions on how to download a file that resembles the gos_proxy_seq_NAST_DM_job6080.tgz archive in the GOS archive folder. Decompress this file and rename your distance matrix from distance_matrix_6080 to gos.dst.
Creating a phylogenetic tree
We will use the PHYLIP program Neighbor to build a neighbor-joining tree of our sequences. To generate the rooted tree:
- Copy neighbor.exe into the directory with your data files and execute the program
- When prompted for the input file type gos.dst and press enter
- Type O followed by enter to indicate we wish to specify an outgroup sequence
- Type 1 followed by enter to indicate that the first sequence should be used as an outgroup (this is why it is important to place the outgroup at the top of our sequence file when creating the distance matrix)
- Type Y followed by enter to build the neighbor-joining tree
This will produce two files: outfile and outtree. The former contains useful information about the construction of your tree while the latter is the neighbor-joining tree in Newick format. The outfile indicates that some branches have been assigned small negative values. This suggests that a more robust method is needed to infer the evolutionary history of these sequences, but this tree will suffice for illustrating the use of mothur. Rename outtree to gos.tre. You can inspect the tree using programs such as FigTree or TreeView.
We now have a distance matrix (gos.dst) and phylogenetic tree (gos.tre) for our sequences. In this section we will analyze the distance matrix with mothur in order to determine the number of operational taxonomic units (OTUs) in our dataset. Start mothur and use the read.dist command to load in our distance matrix and indicate that we are interested in OTUs at a distance less than or equal to 0.10:
mothur > read.dist(phylip=gos.dist, cutoff=0.1, name=gos.names)
The name file gos.names is used to indicate that certain sequences are present multiple times in our dataset. Mothur allows duplicate sequences to be specified in this manner in order to reduce memory and processing requirements.
We can now cluster our sequences using the furthest neighbor algorithm:
mothur > cluster()
This will generate three files: gos.fn.sabund, gos.fn.list, and gos.fn.rabund. Our sequences come from fourteen samples. In order to identify the OTUs present in each of our samples we can use the read.otu command which reads in the gos.fn.list file along with a group file indicating which sample each sequence comes from. Our analysis will focus on OTUs defined as clusters of sequences with a similarity of 97%, 95%, or 90% (i.e., a distance of 0.03, 0.05, or 0.10):
mothur > read.otu(list=gos.fn.list, group=gos.groups, label=unique-0.03-0.05-0.10)
This command will generate the file gos.fn.shared which indicates the number of OTUs across all samples and fourteen files indicating the number of OTUs from each sample (i.e., gos.fn.GS011.rabund, gos.fn.GS012.rabund, ..., and gos.fn.GS122a.rabund).
Mothur provides a number of tools for analyzing individual samples. A critical question in any study is whether the sampling effort was sufficient to account for all OTUs present at a sample site. This can be investigated by plotting a rarefaction curve for each site. Here we will focus solely on sample GS020, but these steps would normally be repeated for all samples considered in your study. To create a rarefaction curve we must first load in the list file for the sample:
mothur > read.otu(list=gos.fn.GS020.rabund)
The data required to plot a rarefaction curve can now be obtained using the rarefaction.single command:
mothur > rarefaction.single()
Since our rarefaction curve at a distance of 0.0 is far from being parallel to the x-axis, we would need additional samples before we could be reasonably confident in results calculated from these OTUs. At higher distances, the curves are closer to parallel so we have more confidence in results calculated using OTUs at these distances.
mothur > collect.single(calc=chao)
This will generate the file gos.fn.GS020.chao. This file can be plotted in a manner analogous to a rarefaction curve and used to evaluate how the results of the calculator change with sampling effort.
For a quick summary of applying different calculators to a sample you can execute the following command:
mothur > summary.single()
This command will produce a summary file gos.fn.GS020.summary which gives the results of applying each of the calculators to the sample when all available data is used.
Mothur contains a number of tools for comparing multiple samples. A quick summary of the shared diversity between samples can be obtained using the summary.shared command:
mothur > read.otu(shared=gos.fn.shared, label=unique-0.03-0.05-0.10) mothur > summary.shared()
This will generate the file gos.fn.shared.summary. Within this file the results of applying different calculators to all possible pairs of samples is given. For example, the first line reads:
label comparison sharedsobs sharedchao sharedace jabund sorabund jclass ... unique GS020 GS028 6.000000 11.000000 4.000000 0.199874 0.333158 0.017045 ...
This indicates that at a distance of 0.0 samples GS020 and GS028 share 6 observed OTUs, have a shared richness of 11 OTUs as estimated by the Chao calculator, and so forth for all calculators supported by mothur.
Venn diagrams can be used to visualize the distribution of OTUs amongst three or four samples. We can use the venn command to generate a Venn diagram with mothur:
Since we specified an interest in OTUs at a distance of 0.0, 0.03, 0.05, 0.10, this will generate four output files: gos.fn.unique.venn.sharedsobs.svg, gos.fn.0.03.venn.sharedsobs.svg, gos.fn.0.05.venn.sharedsobs.svg, and gos.fn.0.10.venn.sharedsobs.svg. These Venn diagrams show the distribution of observed OTUs amongst samples GS018, GS019, and GS020. Different calculators can also be used with the venn command in order to visualize the estimated richness shared by these samples.
The results here show that the two Caribbean samples (GS018, GS019) have more in common with each other than they do with the fresh water sample (GS020):
Heatmaps can be used to visualize the OTUs present in each sample. We can use the heatmap.bin command to generate a heatmap with mothur:
Since certain OTUs are likely to be present in large quantities, it is often useful to scale the heatmap to make visualizing less frequent OTUs easier. This is done with the scale parameter as shown. Note that the default scale is log10 and not linear. The label parameter can be used to specify the clustering distance you are interested in. Executing this command produce the single heatmap file gos.fn.0.03.heatmap.svg shown here:
A few noticable outliers can immediately be identified in the heatmap:
- the second column corresponds to the fresh water site (GS020)
- the ninth column corresponds to the hypersaline site (GS033)
- the last column corresponds to our outgroup (I like to keep the outgroup around for sanity checks such as this)
Community trees can be used to visualize the similarity of different samples. To generate a community tree we need to use the tree.shared command:
read.otu(list=gos.fn.list, group=gos.groups, label=0.03) tree.shared(calc=thetayc, groups=GS028-GS020-GS023-GS022-GS027-GS011-GS012-GS013-GS033-GS014-GS018-GS122a-GS019-GS121)
This will produce a newick-formatted tree file called gos.fn.ThetaYC.0.03.tre which indicates how similar our samples are according to the Yue & Clayton theta structural diversity measure. Programs such as FigTree or TreeViewX can be used to visualize the tree. Our samples result in the following community tree:
As we might have expected from the heatmaps, the fresh water sample (GS020) and hypersaline sample (GS033) are highly dissimilar to all other samples. Notice that the following "pairs of sites" cluster together:
- the estuarine sites (GS011, GS012)
- the coastal site from the Caribbean (GS018, GS019)
- the open ocean site from the Indian Ocean (GS121, GS122a)
- the coastal site from the Galapagos Islands (GS027, GS028)
However, there are some exceptions:
- the open ocean site from the Eastern Tropics (GS022, GS023)
- the coastal site from the Northern Atlantic (GS013, GS014)
These results indicate that salinity has a significant influence of microbial community composition which is in agreement with a recent global study performed by Lozupone and Knight (2007). Further work is needed to determine if geographic proximity influnces microbial community or if the similarity we are seeing between our "pairs of sites" is strictly a function of environmental similarity.
Hypothesis testing approaches
Mothur does not currently support hypothesis testing when sequences must be assigned a weight indicating the number of times they occur in a given community. For smaller data sets, this can be overcome by generating a tree where every sequence (whether it is unique or not) is a leaf node in the tree. Unfortunately, for our data set this is computationally prohibitive since some sequences are present thousands of times. Mothur will likely support assigning weights to sequence in a future release. In the meantime, we can use the UniFrac web portal to analyze our data set. To perform this analysis you will need the tree file gos.tre and the provided environment file gos.env which indicate the abundance of each sequence in a given sample. If you are new to the UniFrac web portal, consider going through the tutorial they provide first.