Chimera.bellerophon

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Use Bellerophon approach to create a sorted priority score list of potentially chimeric sequences.

Algorithm

1. If filter==true, apply a 50% soft filter and generate a filter.align file [this may require pulling filter methods into their own classes]; otherwise no filtering (the Bellerophon server does 50% filtering)
2. Read sequences into ram as a vector of Sequence objects.
3. Find the average midpoint of all sequences in the alignment.
4. Define “left” as positions [1-midpoint] and right as [midpoint-end]
5. Generate a “Preference” structure with string (sequence name) and float (preference score) entries
6. Generate “vector<Preference> preferences(n)” where n corresponds to the number of sequences and the sequence name is stored with each accession.
7. Calculate (use preference instead of col):
$col\left[i\right]=\sum_j^{N}\left|dm^{left}\left[i\right]\left[j\right]-dm^{right}\left[i\right]\left[j\right]\right|$
Where i is the sequence you are on and j is all the other sequences. If correction=T, then dm=sqrt(distance); if correction=F, then don’t transform the distances. The distance calculator should be “eachgap”. This step should be parallelized.
8. Sum across all preferences[i] to get dme
9. Recalculate each preferences[i] value as:
${preference}\left[i\right]=\frac{dme}{dme-2 * col\left[i\right]}$
10. Sort the preferences values from high to low.
11. Output the sorted list to *.chimera as well as the accession id for the closest sequence on the left and right
12. Output to the screen:
• average number of letters on either side of midpoint
• number of sequences with a preference score above 1.000
• min, 2.5 percentile, 25 percentile, median, 75 percentile, 97.5 percentile and max preference scores (see output of summary.seqs)
• sequence with a preference score above the 95 percentile are reported as chimeric.

Default settings

The only required parameter is fasta. The default settings for chimera.bellerophon are filter=F, window=1/4 length of seq, increment=25, correction=T, processors=1.

mothur > chimera.bellerophon(fasta=ex.align)

The output to the screen looks like:

Reading sequences from ex.align...Done.
Processing sliding window: 10
Processing sliding window: 20
Processing sliding window: 30
Processing sliding window: 40
Processing sliding window: 50
Processing sliding window: 60
Processing sliding window: 70
Processing sliding window: 80
Processing sliding window: 90
Processing sliding window: 100
Processing sliding window: 110
Processing sliding window: 120
Processing sliding window: 130
Processing sliding window: 140
Processing sliding window: 150
Processing sliding window: 153
gi|11093939|MNB2|AF293011 is a suspected chimera at breakpoint 2195
It's score is 1.31537 with suspected left parent gi|11093938|MNC2|AF293010 and right parent gi|11093938|MNC2|AF293010
Sequence with preference score above 1.31537: 1
Minimum:	0.635111
2.5%-tile:	0.635111
25%-tile:	0.866302
Median: 	1.00136
75%-tile:	1.13515
97.5%-tile:	1.31537
Maximum:	1.31537

Opening ex.bellerophon.chimera you would see:

Name	Score	Left	Right
gi|11093939|MNB2|AF293011	1.31537	gi|11093938|MNC2|AF293010	gi|11093938|MNC2|AF293010
gi|11093930|MNH4|AF293002	1.29437	gi|11093925|MNG7|AF292997	gi|11093927|MND8|AF292999
gi|11093937|MNF2|AF293009	1.23542	gi|11093929|MNC12|AF293001	gi|11093927|MND8|AF292999
gi|11093926|MNH2|AF292998	1.21045	gi|11093924|MNF4|AF292996	gi|11093925|MNG7|AF292997
...

Options

filter

By default the filter parameter is set to false, but if you set it to true a 50% soft filter will be applied.

mothur > chimera.seqs(fasta=ex.align, filter=t)

window

By default the window parameter is set to 1/4 length of seq, but if you set it up to half the sequence length.

mothur > chimera.seqs(fasta=ex.align, window=200)

increment

The increment parameter determines how far the 2 windows "slide" each iteration. By default the increment parameter is 25, but if you may set it up to sequence length minus twice the window.

mothur > chimera.seqs(fasta=ex.align, method=bellerophon, increment=100)

correction

By default the correction parameter is set to true, meaning the the square root of the distances is used instead of the distance value.

mothur > chimera.seqs(fasta=ex.align, method=bellerophon, correction=f)