We will be offering an R workshop December 18-20, 2019. Learn more.
Limits of T-RFLP fingerprinting of microbial communities
In this tutorial we will perform an analysis of sampling capacities of archaeal primer sets that are regularly used for fingerprinting of microbial communities using Terminal-Restriction Fragment Length Polymorphism (T-RFLP). In order to approximate the complexity of microbial communities, one can adopt the complete Ribosomal Database Project II (RDP II), the GreenGenes  as a starting point to analyses of bacterial and archaeal 16S rRNA genes, or Functional Gene Database / Repository  for analyses of various functional genes. As an example, we are going to use BEsTRF  to generate primer-pair specific sequence collections of archaeal 16S rRNA genes. Specifically, we will be looking whether statistically significant differences exist between RDP II archaeal subsamples generated by the variants of the widely used primer sets covering the same conserved regions in close proximity of existing primers Ar109f and Ar927r. This approach will generate six in-silico libraries comprising 9412 sequences altogether. As a separate topic, the BEsTRF generated in-silico T-RFLP profiles are going to be analyzed separately (Pc-ORD) and the congruency of the two approaches compared.
Here is the basic outline:
Cominations of all these five primers resulted in six combinations with identical location of forward primer, but varying recognition capabilities:
Combinations with the second forward primer A109FK were prefixed with W - wider detection.
1. Getting started
Making an Alignment
Fortunately comprehensive datasets of 16S rRNA genes can be obtained from RDP II , GreenGenes, SILVA,other or your own as an alternative. As mentioned, FGD/R  can be used to explore also functional genes.
1. Go to Resources subpage 
2. Select database of choice in FASTA format
3. Download to a selected folder and unpack. Beware, the aligned database will unpack to 20 Gb.
Generation of Sequence Subsamles Using BEsTRF 
1. Specify primer pairs of choice, sequence database and restriction endonucleases you intend to use for analyses.
2. Modify BEsTRF parameter file selected from worked examples or prepare your own from the supplied BEsTRF parameter file 
3. Run your analyses. Sequence subsets are now stored in one novel directory as primer-pair specific text files (e.g. A109f_A927r.txt). They contain sequences detected by the selected primer pairs.
Making an Alignment
We will first generate our multiple sequence alignment using the Ribosomal Database Project's Pyrosequencing Pipeline  as the Greengenes NAST aligner was less amenable to align datasets of 1000+ sequences on a routine basis. Under ‘’Data processing’’ step  you will find ALIGNER.
1. provide your email address – you will receive data over email; I prefer to use gmail due to the size of the files sent.
2. type in the code
3. browse and select your file containing sequences, confirm, wait until the page is updated, and confirm again
Wait until your alignment is delivered to your email. Download to preferred location, unpack. We will rename the file to alna0mm.txt to make clear its content.
Making a Distance Matrix
To generate necessary distant matrices we will again use Ribosomal Database Project's Pyrosequencing Pipeline as above. Select ‘’Formats for common programs’’ and choose Mothur: Column Distance Matrix - create a column distance matrix compatible with Mothur.
1. provide your email address – you will receive data over email again.
2. type in the code
3. browse and select your unpacked alignment, confirm, wait until the page is updated to see the RDP’s confirmation message. Wait until your alignment is delivered to your email. Download, unpack.
You have now generated two additonal files containing distance matrix (a0mm.txt.dist) and the mothur name file (a0mm.txt.names). Both files are ordinary text files that can be viewed in text editors, such as Notepad, Wordpad, Word in Windows or Kate or other in Linux.
If we want to be able to generate trees for subsequent analyses using TreeClimber or Unifrac, we should simply generate a second distance matrix suitable for simple tree construction. We will use the second option of Ribosomal Database Project's Pyrosequencing Pipeline - Mothur: Phylip Distance Matrix to create a matrix and sample group file compatible with Mothur's LIBSHUFF function. Thus, it can be used for tree construction using Phylip program Neighbor.exe following these instructions:
1. Download the Phylip package.
2. Copy the executable neighbor.exe into the directory with your data files and run the program.
3. When prompted for the input file type the name of the alignment file (e.g. a0mm.phy). For those less fluent, you can simply use ClustalX  or BioEdit  to open and save the alignment in desired phy format, before starting neighbor.
4. Modify parameters according to your preferences like D to select for Jukes-Cantor or other distance. Default can be used for all other parameters.
5. When done, copy the file containing tree into the mothur folder for subsequent analyses using TreeClimber or UniFrac.
Making a group file
At this point you should have the "group" file ready. This file contains a comparative list of sequence names and environments they belong to in a simple text format as it is needed by many Mothur applications. There are certainly many ways how to get there:
i) arb 
ii) BioEdit or simple text editors (for those who avoid arb):
1. Start BioEdit
2.Click 'New alignment', ‘Open’your either aligned or unaligned fasta or txt sequence file. This most nicely works when all your sequences were meticulously marked at the beginning of your work, resulting thus in a collection of sequences labeled according to environments soil, sediment, air: SOIL1, SOIL2,....SOIL1851; SED1, SED2, ….SED2011; AIR1,AIR2, AIR918,
3.Once your sequences are imported, ‘Edit’. ‘Select all sequences’, the sequence titles become black,’Edit’, ‘Copy sequence titles’ and paste them to text processing or spread sheet program. Save the file as a0mmgroup.txt.
4.open the file in a spread sheet program, duplicate the column with simple copy-paste and find-replace the characters from the second column according to your needs. DO NOT INCLUDE ANY SPACES IN YOUR ENVIRONMENT NAME.
5.all sequences are now organized according to their environments
6.Name your file, save as text. Later, rename the file extension into .group
2.1 Distance matrix and clustering
Now we can use the distance matrix created from RDP’s alignment to determine the number of operational taxonomic units (OTUs) in mothur. Start mothur and load the distance matrix by using the read.dist() command. Note, either of the two distance matrices generated above can be used, just adjust the command according to the distance matrix type. The complete phylip matrix takes much longer to read.
For the first above:
mothur > read.dist(column=a0mm.dist, cutoff=0.2, name=a0mm.name)
For the second (phylip compatible) type:
mothur > read.dist(phylip=a0mm.dist, cutoff=0.2)
Using this composite command, sequences were grouped using the furthest neighbour algorithm according to cut-off of 0.2.
I then clustered my sequences with:
mothur > cluster()
The default (furthest neighbor, fn) clustering of the 9412 archaeal sequences produced three files:
a0mm.fn.rabund (rank abundance file),
a0mm.fn.sabund (species abundance file),
a0mm.fn.list (a list file containing the sequences in each OTU).
I then read the OTU .list of 9412 sequences created using the cluster() command at various distances distance with:
mothur > read.otu(list=a0mm.fn.list, label=unique-0.03-0.05-0.1-0.2, group=a0mm.groups)
The mothur's group option reads a .groups file, a tab delimited text file, that was created as described above using either arb or BioEdit or simple text editos. This file contains two columns, one with 9412 equence names followed by a column describing environmental names (short, medium, long, wshort, wmedium, wlong). This enables mothur to generate a .fn.shared file and 6 separate list files, one for each treatment.
2.2 Single sample analyses - rarefaction, Collector's curves, richness
We want to explore the sampling effort we invested into analysis of each sample, namely, which of the six primer combinations captured archeal diversity most efficiently. As we are working with 16S rRNA we will explore various cut-off values reflecting phylogenetic relatedness. Collector’s curves and rarefaction curves can be calculated in addition to numerous diversity indices. But to do so, first read in the respective list file of particular sample:
mothur > read.otu(list=wmedium.fn.list)
use rarefaction.single() to generate a rarefaction curve named wmedium.fn.rarefaction
mothur >rarefaction.single() mothur > collect.single(calc=chao-pl_shanon)
These commands will generate a number of files. Rarefaction and Collector's curve address the issue of sampling effort we invested to cover the diversity present in microbial communities. Open the files and plot the curves at various OTU definitions approximating species, genus and further higher taxa. Observe at which distance they level off for our six archeal samples.
An example of rarefaction is presented here:
This figure represents the six rarefaction curves from archaeal 16S rRNA sequencs sampled by different variants of the same primers (A109f and A927r)from Greengenes database. As you can see, there are marked differences between the primer sets in the number of detected sequences and also in the number of detected OTU. We will explore whether these differences are significant or not below.
A summary file containing final values of various richness estimators or predictors was created using:
mothur > read.otu(list=wlong.fn.list) mothur > summary.single()
or, you can specify specific estimators that you prefer most
mothur > summary.single(calc=nseqs-sobs-ace-chao)
The first command instructs mothur to read the previously created .list file containing sequences belonging to particular environment(short, medium, long, wshort, wmedium, wlong). The second command generates the richness summary file, containing only the final estimators, for the selected .list file. Thus, these commands need to be run for each sample, in this case, six times.
2.3 Multiple sample comparison
We will use a set of mothur tools to visualize the data, identify significantly different samples and test their overall similarity grouping.
To get the first impression of the microbial community overlap between the various primer sets the number of shared OTU was estimated using the following commands:
mothur > read.otu(list= malna0mm.txt.fn.list, group=a0mm.groups) mothur > read.otu(shared=malna0mm.txt.fn.shared) mothur > summary.shared()
The generated summary file contains information on pairwise comparisons and the number of shared OTU at certain cut-off value that was set previously. It looks like this:
To perform one of the most basic visualisations of our data, we will use one of the most widely used coefficients - the Bray-Curtis similariy in a form of (1-s) dendrograms. This newick-formatted tree file can be simply viewed by TreeView.
mothur > read.dist(phylip=malna0mm.txt.dist, cutoff=0.20) mothur > cluster() mothur > read.otu(list= malna0mm.txt.fn.list, group= malna0mm.txt.groups, label=0.03) mothur > tree.shared(calc=braycurtis)
This is how the tree later looks like.
As you can see, in-silico community sampling using different sets of primers covering the same region resulted in two separate clusters of virtual communities. Keep in mind, a single community was sampled, and what we are testing are the pictures of this single community obtained by different means.
Compare this dendrogram obtained from detected sequences with a dendrogram obtained from terminal restriction fragments at 100% detection (0% cut off value) (BEsTRF generated T-RF matrix was analyzed in PCORD).
As you can see, the distribution of virtual communities into clusters reflected more the choice of forward primer used in T-RFLP. Note, similarities are shown in this dendrogram, however, similar span of dissimilarities was obtained for both approaches.
To calculate all three types of similarity indices (distances here are 1-similarity value) and to explore the OTU data for the lines labeled unique, 0.03, 0.05 and 0.10 type the following commands:
mothur > read.dist(phylip=malna0mm.txt.dist, cutoff=0.20) mothur > cluster() mothur > read.otu(list= malna0mm.txt.fn.list, group= malna0mm.txt.groups, label=unique-0.03-0.05-0.10) mothur > tree.shared(calc= jest-thetayc-braycurtis)
It is worth to mention, that other properly formatted (e.g. phylip style) distance matrices of other data obtained by other programs can be analyzed, properly formatted groups files describing the environmental origin of data need to be supplied as well.
Hypothesis testing - Libshuff
Libshuff allows as to test whether the differences in OTU overlap, clustering according to Bray-Curtis (dis)similarity are in fact significant. The following commands will execute the libshuff:
mothur > read.dist(phylip=alna0mm.txt.dist, group=aa0mm.groups) mothur > libshuff()
This will generate two more output files:
Beware, as we performed multiple comparisons the Bonferroni correction should be invoked. Thus we will consider a comparison significant if its significance score is 0.05 divided by the number of comparisons(30) <0.00167. Here are the results:
Hypothesis testing - UniFrac
a) Unweighted UniFrac
The unique branch length of a tree that can be ascribed to each community serves as a distance measure. As such it can be used to test null hypothesis (Ho), namely communities do not differ significantly. This analysis can be invoked by typing in the command:
mothur > read.tree(tree=a0mm.tree, group=a0mm.groups) mothur > unifrac.unweighted(iters=10000)
The latter command guides mothur to perform a number of randomizations to increase the accuracy of the p-value.
b) Weighted UniFrac
In this instance, Weighted UniFrac compensates for the abundance of sequences. Thus the branch length of the tree node with many closely related sequences from one environment is weighted accordingly in compaison to a single sequence from a different enironment present in the same node.
The only difference from Unweighted UniFrac example above is:
mothur > read.tree(tree=a0mm.tree, group=a0mm.groups) mothur > unifrac.weighted(iters=10000)
The following output is in the .wsummary file:
Note on UniFrac: The information present in published clone libraries can be tested despite the drawback that sequences do not overlap. However, approach not relying on distance matrices for tree generation needs to be adopted. For example, we can use arb. As a drawback, the lack of overlap between sequences unfortunately prevents the use of the complete range of tools available in mothur. See the second example Diversity of cold soil microbial communities.