We will be offering an R workshop December 18-20, 2019. Learn more.

Pyrosequencing of functional genes - basic data processing, comparing alignment strategies, and integrating mothur output with ARB

From mothur
Jump to: navigation, search


Here is an example of some ways to process and analyze data from pyrosequencing of functional gene amplicons. This tutorial walks through some basic steps in the analysis pipeline focusing on common data processing steps, selecting a proper alignment strategy, and integrating the output from mothur into ARB. This tutorial also provides some software tools (perl and bioperl scripts) that may be generically useful to many in the community.

Because multiple sequence alignment is such a critical step in any phylogenetic analysis, the 1st part of this page focuses on a comparison of different strategies to obtain an accurate alignment. To do this, I generated a training dataset by pyrosequencing amplicons produced via cell-suspension PCR from an axenic bacterial colony growing on solid medium. The starting point for pyrosequencing data analysis is a fasta file with loads of sequences, in this case 39,791. All of the data shown here reflect sequences from the beta-subunit of the sulfite-reductase (dsr) gene with a set of PCR primers I designed to target the genus Desulfobulbus in the class delta-Proteobacteria.

After empirically choosing a sound alignment strategy, the 2nd part of this page shows several ways to efficiently get pyrosequencing data (and just as important, metadata) into ARB. ARB has lots of tool to perform simple to complex queries of data to select subsets of sequences to build trees or highlight particular groups, but in my experience, these tools are under-utilized because of the difficulties faced by most users of manually entering data to be associated with their sequences. Here I show an easy way to powerfully integrate mothur and ARB.

Initial Cleaning and Processing of Data

The first thing to do upon receiving a batch of pyrosequencing data is to clean up the sequence names by removing the long header put into the fasta file and then giving the sequences some unique names that have some meaning and can be used for later analyses. This 'fasta_substring_renamer' script just strips out characters based on their position. So, for example, a batch of sequences typically might have headers like this:

>FUAROQY02D9NMW rank=0000723 x=1633.5 y=342.0 length=389

First, run the file through the script keeping characters in positions 0 to 14 to get this:


Next, we run the file through the 'fasta_renamer' to replace 'FUAROQY02' which is common to all seqs with something more meaningful. In the case of our training dataset we now get a fasta file with headers that convey some meaning in a systematic way and are each unique:


Much better, but so far all we have done is rename our sequences. This is actually not a trivial step as adding meaningful (and standardized) names is very helpful for downstream analyses.

In any event, the next step is to do some preliminary screening of the data using a number of automated steps. In brief, this training data set was cleaned by first screening sequences for both forward and reverse primer motifs and also excluding any sequences with ambiguous base calls. This left 23714 out of 391791 sequences. Next, because these represent a protein-coding gene, I could do a further quality screen by translating each sequence in all three forward frames to meet the following criteria:

  • Presence of a highly conserved 3 AA motif >120 AA from start of sequence
  • <1 unknown AA ('X)
  • No stop codons

19202 sequences passed this screen and are now ready for Mothur. We'll call the file 'dsr_full_training_set.fas'.

Comparison of alignment strategies with common alignment algorithms and mothur

For 16S data, alignments can be done using a high-quality representative database like Silva or GreenGenes, but for functional genes, you will have to do some in-house aligning even if you have a good seed database to start with because you will almost surely recover a number of new sequence types that can't be recruited properly against a pre-existing database. So, how to align your many thousand pyrosequences?? Multiple-sequence alignments quickly become computationally intractable (Table 1), so some alternative will have to be found.

Table 1. Comparison of running times (hh:mm:ss) of popular multiple-sequence alignment (MSA) programs. Each program was run with default parameters using simple benchmark datasets comprising nucleotide sequences generated from in silico random point mutations (including indels) of a single 380 bp template sequence. Sequence variants were randomly distributed between 0 and 10% divergence from the template sequence. Blank cells indicate processing failure; tests were conducted on a 2Ghz Pentium CPU w/ 3GB RAM.

Since Muscle, kalign and MAFFT were the only programs left standing in our benchmark tests, we will compare them with our training data set. First, let's use mothur to reduce the data set to a more manageable size.

mothur's 'unique.seqs' command can reduce our training data set to unique sequences, resulting in a set of 5064 sequences:

mother > unique.seqs(fasta=dsr_full_training_set.fas)

Unfortunately, even with this reduced data set, Muscle runs out of memory when run with the default parameters, so we are forced to use the fastest (and least accurate) parameters:

~%muscle -in infile.fas -out outfile.fas -maxiters 1 -diags

I also used the 'maxmb 2000' flag to maximize the memory allocation and set '-gapopen -500 -gapextend -500' based on previous optimizations. We can just eyeball this alignment and see that it looks pretty good, but does contain a lot of gaps; the alignment stretches to 450 bp despite a mean (and median) length of 380 bp in the unaligned file. Remember, these are sequences from a clonal isolate, so it is likely that Muscle has sacrificed positional homology and introduced too many gaps into the alignment.

I have tried two approaches to solve the problem of Muscle and large datasets, neither very successful.

  • The 1st was to use a colleague's computer with 16 GB of RAM. This took 66 hrs to align a similar dataset and the alignment was terrible.
  • The 2nd was to invoke Muscle via a script which splits a large file into a user-defined number of sequences (500 was best) and aligns these to each other, then to a reference sequence using the -profile command, then aligns the next batch of 500, appends these, and so on. This greatly improved processing time (ca. 1 hr vs. 66 hrs), and the alignment was better, but still obviously suboptimal, particularly the alignment of one batch of sequences to another.

MAFFT does a bit better and can align all 5064 sequences with default parameters in about 2.5 minutes:

~%mafft infile.fas > outfile.fas

The alignment from Kalign looked absolutely terrible and so I discarded it from further consideration.

Next, let's compare the Muscle and MAFTT alignments a little more rigorously to be sure we have the best alignment strategy to apply to a real dataset.

Assessment of alignment quality

We can assess the accuracy of our multiple-sequence alignment as follows. First, we need to know what the 'true' alignment of each sequence should be. To calculate this, I implemented the dpAlign pairwise alignment algorithm to perform a high-quality global pairwise comparison of each of the 5064 sequences to the Sanger reference sequence of the isolate that was pyrosequenced. I then made another script which took the Muscle and MAFFT alignments in turn and compared each sequence to the reference sequence as aligned in that particular alignment to produce a file that looked something like this:

Seq_ID		Pairwise_ident		Muscle_ident		MAFFT_ident
33_3CO2XW	0.26			0.25			0.23
33_3CO1DJ	0.53			0.5			0.46

Because this training data set represents pyrosequences from a clonal isolate, we can take the pairwise identity value as a measure of sequencing errors, and the absolute difference between the pairwise identity and the alignment identities as a measure of the alignment accuracy. For these two alignment algorithms, it looks like Muscle did a slightly better job with our 5064 sequences (Figure 1), but still, we would like to improve on its performance.

Figure 1. Comparison of Muscle and MAFFT alignments. Dashed line shows mean value.

Generating an accurate seed alignment from scratch combining Muscle and mothur

Let's continue with this scenario in which a user might want to make their own seed alignment from scratch (e.g. no other well-aligned set of sequences are available for your sequences), but now see if we can improve on our alignment by using mothur. Because accurate MSA algorithms are inherently computationally expensive, one solution is to use an iterative approach to reduce our data to a representative set of sequences that can be more accurately aligned to be used as a template alignment for the whole database using mothur.

So, we will first take our imperfect Muscle alignment of the 5064 unique sequences (a known unknown if you like) and proceed with the typical mothur steps at this point, creating a distance matrix, reading the matrix with the original name file, and then clustering:

mothur > dist.seqs(fasta=muscle_outfile.fas, cutoff=0.10)
mothur > read.dist(column=muscle_outfile.dist, name=dsr_full_training_set.names)
mothur > cluster()

mothur has to think about that for a few minutes, but soon returns its three output files: '*.list', '*.sabund'. and '*.rabund'. The 1st two columns of the '*.list' file look like this:

unique	5064
0	3887
0.01	438
0.02	39
0.03	8
0.04	3
0.05	1

Now we will use the handy 'get.oturep' command to get a list of representative sequences for each OTU:

mothur > get.oturep(fasta=muscle_outfile.fas,

We'll choose the 0.01 cutoff as both a conservative OTU definition and a feasible number (438) of sequences to work with. So, next, we will align these 438 sequences with Muscle using more accurate parameters and then use that as a seed alignment for the mothur aligner. So, in muscle:

~%muscle -in infile -out outfile -gapopen -500 -gapextend -500

This takes literally just 1 minute, and now our alignment of these representative sequences looks to be quite accurate and a more reasonable 386 bp (versus 450 bp above). Now we are ready to use this as a seed alignment for the mothur aligner. We'll name it 'dsr_seed_aln.fasta' and align the deconvoluted training set file against it:

mothur > align.seqs(candidate=dsr_full_training_set.unique.fasta, template=dsr_seed_aln.fasta, 
ksize=9, align=needleman, gapopen=-1)

That actually crashed mothur, producing the error 'Standard Error: basic_string::substr has occurred in the AlignCommand class Function driver.' After some fiddling around, appending either a '.' or a '-' character to the ends of the alignment solved the problem. Ok, now we get our output:

Reading in the dsr_seed_aln.fas template sequences...  DONE.
Generating the dsr_seed_aln.9mer database...   DONE.

It took 46 seconds to align 5064 sequences

So it looks like we have now successfully used our seed alignment as a template to align the whole dataset. How does the mothur alignment compare to that produced by Muscle and MAFFT shown above? Favorably (Figure 2).

Figure 2. Comparison of mothur, Muscle, and MAFFT alignments. Dashed line shows mean value.

So, bottom line: the mothur aligner was faster and better than Muscle or MAFFT implemented naively, with a couple important caveats. I also tried the mothur aligner with a template of just the Sanger reference sequence by itself gapped according to an accurate protein alignment. It performed terribly, mostly because sequences with high sequencing error rates aligned poorly. So, for now at least, the take home message is that the mothur aligner works quite well, but the template alignment should contain a set of properly aligned sequences that are representative of the experimental dataset.

Translating sequences and protein alignments

An obvious question is, why not just translate the sequences, align the amino acids and either deal with that data directly or transfer that alignment back to the nucleotides? I have played around with this quite a bit with both the training data set here and real data from environmental samples, and so far have concluded that unless very rigorous data screening and quality checks are performed this can be a very risky approach because of the potential for frameshift errors (insertions or deletions due to pyrosequencing errors). If you choose to go this route, I would recommend screening for multiple known motifs at both the 5' and 3' ends of your sequences; each gene and each set of taxa will need custom-tailored approaches.

In any event, once you have properly cleaned and aligned your massive pyrosequencing dataset, you will probably want to know about our final topic:

Getting a set of representative sequences and metadata into ARB

Since the basic implementations of mothur to cluster sequences and compare samples are covered pretty well elsewhere in this tutorial section, the final thing I will do here is offer my own version of an 'iPhone App' for mothur - a quick and easy way (and free!) to get any data associated with your sequences into an ARB database.

Although ARB can handle large amounts of sequence data, in my experience the aligner and PT-server in particular are poorly suited to large numbers of sequences. Certainly building robust trees with thousands of sequences is difficult and probably not worth doing anyways since they can't be properly visualized without grouping the data. As a solution, it works much better to define sequences representative of your OTU classifications and build a data table containing metadata of interest to you to import into ARB. ARB is quite well-built as a database and has powerful querying and selection tools to help sorting through data, but for most users the bottleneck comes in entering the metadata to be associated with your sequences that can be queried. Here is an automated solution:

The first step is to define representative sequences to start building a data table. The 'get.oturep' command in mothur is quite good for this, or the 'mothur_conversion' script will pull a user-defined number of representative sequences out of the '*.list' file. The script will give you a table something like this:

Seq_ID		OTU_cutoff	OTU	OTU_membership
S1_ABC		0.03		1	27
S1_XYZ		0.03		2	9

Next, the 'split_fasta_file_based_on_a_list' script will take the sequence IDs from that text file as input and pull them out of the original fasta file to create a new fasta file containing just the representative sequences. The 'get.oturep' command in mothur also automatically generates fasta files for particular OTU cutoffs and is definitely more convenient when using mothur, but I have linked this script as it is also quite useful generally to split a fasta file based on a list of sequence ID's.

The next step may be the most important. Once you have built a data table like the one above with representive sequences (adding environmental data, patient data, etc., etc.) for each sample as appropriate, the 'ARB_format_conversion' script will reformat any data table (i.e. spreadsheet) into a form that can be imported into ARB and will simultaneously generate the appropriate custom-import filter for ARB. All the required info on formatting the data table, etc. is given when you run the script. In a nutshell:

  • Each column in your spreadsheet must have a column heading. This is what the field will be called in your ARB database.
  • The first column must contain the sequence names as a unique identifier. ARB doesn't like duplicates. When prompted in ARB, choose the option 'Use found names' since we already went through the trouble above to produce meaningful and unique names for each sequence. This way you will avoid the nonsensical unique name field imposed by ARB.

Basically, that's it. You can have as many fields as you like and easily build the table as a spreadsheet, instead of manually in ARB which would of course render you insane and create lots of errors even if you tried. The sequence data itself can either be directly in your speadsheet, or you can put dummy data in (e.g. 'accg') and then merge your new ARB database with an existing database as long as the name fields match up properly.

The script will give you two files - 1) your data table reformatted ready to be imported into ARB using your 2) new custom import filter. The import filter needs to be placed in the 'import' directory of your ARB installation which is usually at 'usr/arb/lib/import'. ARB can now be used visualize any of the metadata you add in a tree and to perform simple but powerful queries to identify for example, only singleton OTUs, OTUs in the top decile of rank abundance, sequences from northerly latitudes, sick patients, etc.

Hopefully this tutorial will assist in the processing and analysis of the large volumes of data generated by pyrosequencing and integrating these data into ARB. Feel free to contact me with any questions at b.b.c.oakley at warwick.ac.uk

last edit: Boakley 07:03, 23 June 2009 (EDT)