Screen.seqs

From mothur

Jump to: navigation, search

The screen.seqs command enables you to keep sequences that fulfill certain user defined criteria. Furthermore, it enables you to cull those sequences not meeting the criteria from a names, group, or align.report file. As an example, we will use the example from the Sogin data analysis example.


Contents

Default settings

First, we can get a summary of the characteristics of the unaligned or aligned sequences:

mothur > summary.seqs(fasta=sogin.unique.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	109	222	40	0	2
2.5%-tile:	4648	4875	57	0	3
25%-tile:	4655	4928	60	0	3
Median: 	4655	4928	62	0	4
75%-tile:	4655	4928	65	0	4
97.5%-tile:	4655	4939	73	0	6
Maximum:	6793	6850	100	0	31
# of Seqs:	21907

You will be able to screen for each of these characteristics using the start, end, maxambig, maxhomop, minlength, and maxlength options. Any of these options can be combined in a single command. Problematic sequence names can also be removed from the name and group files with the name, and group options.


Options

start & end

You may have noticed that when you make an alignment there are some sequences that do not align in the same region as most of the sequences that you are analyzing. For example, in the Sogin dataset, there was at least one sequence whose alignment started at position 109 and a sequence that ended by position 222 while most sequences started by 4655 and ended by position 4928. The following command will remove sequences that start after position 4655:

mothur > screen.seqs(fasta=sogin.unique.align, start=4655)

To remove the sequences that end before position 4928, the following command will work:

mothur > screen.seqs(fasta=sogin.unique.align, end=4928)

Putting it all together the following command will remove those sequences that don't start by position 4655 and do end before position 4928:

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928)
mothur > summary.seqs(fasta=sogin.unique.good.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	4615	4928	50	0	2
2.5%-tile:	4652	4928	57	0	3
25%-tile:	4655	4928	60	0	3
Median: 	4655	4928	62	0	4
75%-tile:	4655	4928	65	0	4
97.5%-tile:	4655	4938	72	0	6
Maximum:	4655	5014	100	0	31
# of Seqs:	20670

minlength & maxlength

In some studies an investigator may get a bunch of 5' sequences and some full length sequences, in others an investigator might want to make sure that pyrosequences are within some window for length as a quality control measure. As an example, the Sogin data set had 16S rRNA V6 pyrosequences ranging in length between 40 and 100 bp. To remove sequences shorter than 50 bp, the minlength option should be used:

mothur > screen.seqs(fasta=sogin.unique.align, minlength=50)

Alternatively, to remove sequences longer than 70 bp, the maxlength option should be used:

mothur > screen.seqs(fasta=sogin.unique.align, maxlength=70)

Putting it together to keep only those sequences that are between 50 and 70 bp:

mothur > screen.seqs(fasta=sogin.unique.align, minlength=50, maxlength=70)
mothur > summary.seqs(fasta=sogin.unique.good.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	109	222	50	0	2
2.5%-tile:	4648	4875	57	0	3
25%-tile:	4655	4928	60	0	3
Median: 	4655	4928	62	0	4
75%-tile:	4655	4928	64	0	4
97.5%-tile:	4655	4939	69	0	6
Maximum:	6785	6850	70	0	9
# of Seqs:	20160


maxambig

As part of their initial screening procedure, Sogin and colleagues removed any sequences with any ambiguous bases. You can do this with the maxambig option:

mothur > screen.seqs(fasta=sogin.unique.align, maxambig=0)


maxhomop

While we don't necessarily know the longest acceptable homopolymer for a 16S rRNA gene, the max length of 31 is clearly a sequencing artifact. If you are interested in removing sequences with excessively long homopolymers, then you should use the maxhomop option:

mothur > screen.seqs(fasta=sogin.unique.align, maxhomop=8)
mothur > summary.seqs(fasta=sogin.unique.good.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	109	222	40	0	2
2.5%-tile:	4648	4875	57	0	3
25%-tile:	4655	4928	60	0	3
Median: 	4655	4928	62	0	4
75%-tile:	4655	4928	65	0	4
97.5%-tile:	4655	4939	73	0	6
Maximum:	6793	6850	100	0	8
# of Seqs:	21896


group, name, alignreport

If you apply screen.seqs as we have in the previous sections and then attempt to run any downstream commands that need a name, group, align.report file, the files will be incompatible since those files will still contain information for sequences that you culled. To get around this you can use the group and name option:

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, name=sogin.names, group=sogin.groups, alignreport=sogin.unique.align.report)
mothur > summary.seqs(fasta=sogin.unique.good.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	4615	4928	50	0	2
2.5%-tile:	4652	4928	57	0	3
25%-tile:	4655	4928	60	0	3
Median: 	4655	4928	62	0	4
75%-tile:	4655	4928	65	0	4
97.5%-tile:	4655	4938	72	0	6
Maximum:	4655	5014	100	0	6
# of Seqs:	20530


Looking at the sogin.good.names files and sogin.good.unique.good.align.report files we see that there are now 20,530 entries representing a total of 216,243 sequences, which agrees with sogin.unique.good.align. Next, looking at sogin.good.groups we see that there are a total of 216,243 sequences, which also agrees.

processors

The processors parameter allows you to run the command with multiple processors. By default processors is 1, and use of multiple processors is not available for windows users. If you are using the mpi-enabled version, processors is set to the number of processes you have running.

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, name=sogin.names, group=sogin.groups, alignreport=sogin.unique.align.report, processors=2)
Personal tools