Screen.seqs
From mothur
The screen.seqs command enables you to keep sequences that fulfill certain user defined criteria. Furthermore, it enables you to cull those sequences not meeting the criteria from a names, group, or align.report file. As an example, we will use the example from the Sogin data analysis example.
Contents |
Default settings
First, we can get a summary of the characteristics of the unaligned or aligned sequences:
mothur > summary.seqs(fasta=sogin.unique.align) Start End NBases Ambigs Polymer Minimum: 109 222 40 0 2 2.5%-tile: 4648 4875 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 65 0 4 97.5%-tile: 4655 4939 73 0 6 Maximum: 6793 6850 100 0 31 # of Seqs: 21907
You will be able to screen for each of these characteristics using the start, end, maxambig, maxhomop, minlength, and maxlength options. Any of these options can be combined in a single command. Problematic sequence names can also be removed from the name and group files with the name, and group options.
Options
start & end
You may have noticed that when you make an alignment there are some sequences that do not align in the same region as most of the sequences that you are analyzing. For example, in the Sogin dataset, there was at least one sequence whose alignment started at position 109 and a sequence that ended by position 222 while most sequences started by 4655 and ended by position 4928. The following command will remove sequences that start after position 4655:
mothur > screen.seqs(fasta=sogin.unique.align, start=4655)
To remove the sequences that end before position 4928, the following command will work:
mothur > screen.seqs(fasta=sogin.unique.align, end=4928)
Putting it all together the following command will remove those sequences that don't start by position 4655 and do end before position 4928:
mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928) mothur > summary.seqs(fasta=sogin.unique.good.align) Start End NBases Ambigs Polymer Minimum: 4615 4928 50 0 2 2.5%-tile: 4652 4928 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 65 0 4 97.5%-tile: 4655 4938 72 0 6 Maximum: 4655 5014 100 0 31 # of Seqs: 20670
minlength & maxlength
In some studies an investigator may get a bunch of 5' sequences and some full length sequences, in others an investigator might want to make sure that pyrosequences are within some window for length as a quality control measure. As an example, the Sogin data set had 16S rRNA V6 pyrosequences ranging in length between 40 and 100 bp. To remove sequences shorter than 50 bp, the minlength option should be used:
mothur > screen.seqs(fasta=sogin.unique.align, minlength=50)
Alternatively, to remove sequences longer than 70 bp, the maxlength option should be used:
mothur > screen.seqs(fasta=sogin.unique.align, maxlength=70)
Putting it together to keep only those sequences that are between 50 and 70 bp:
mothur > screen.seqs(fasta=sogin.unique.align, minlength=50, maxlength=70) mothur > summary.seqs(fasta=sogin.unique.good.align) Start End NBases Ambigs Polymer Minimum: 109 222 50 0 2 2.5%-tile: 4648 4875 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 64 0 4 97.5%-tile: 4655 4939 69 0 6 Maximum: 6785 6850 70 0 9 # of Seqs: 20160
maxambig
As part of their initial screening procedure, Sogin and colleagues removed any sequences with any ambiguous bases. You can do this with the maxambig option:
mothur > screen.seqs(fasta=sogin.unique.align, maxambig=0)
maxhomop
While we don't necessarily know the longest acceptable homopolymer for a 16S rRNA gene, the max length of 31 is clearly a sequencing artifact. If you are interested in removing sequences with excessively long homopolymers, then you should use the maxhomop option:
mothur > screen.seqs(fasta=sogin.unique.align, maxhomop=8) mothur > summary.seqs(fasta=sogin.unique.good.align) Start End NBases Ambigs Polymer Minimum: 109 222 40 0 2 2.5%-tile: 4648 4875 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 65 0 4 97.5%-tile: 4655 4939 73 0 6 Maximum: 6793 6850 100 0 8 # of Seqs: 21896
group, name, alignreport
If you apply screen.seqs as we have in the previous sections and then attempt to run any downstream commands that need a name, group, align.report file, the files will be incompatible since those files will still contain information for sequences that you culled. To get around this you can use the group and name option:
mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, name=sogin.names, group=sogin.groups, alignreport=sogin.unique.align.report) mothur > summary.seqs(fasta=sogin.unique.good.align) Start End NBases Ambigs Polymer Minimum: 4615 4928 50 0 2 2.5%-tile: 4652 4928 57 0 3 25%-tile: 4655 4928 60 0 3 Median: 4655 4928 62 0 4 75%-tile: 4655 4928 65 0 4 97.5%-tile: 4655 4938 72 0 6 Maximum: 4655 5014 100 0 6 # of Seqs: 20530
Looking at the sogin.good.names files and sogin.good.unique.good.align.report files we see that there are now 20,530 entries representing a total of 216,243 sequences, which agrees with sogin.unique.good.align. Next, looking at sogin.good.groups we see that there are a total of 216,243 sequences, which also agrees.
processors
The processors parameter allows you to run the command with multiple processors. By default processors is 1, and use of multiple processors is not available for windows users. If you are using the mpi-enabled version, processors is set to the number of processes you have running.
mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, name=sogin.names, group=sogin.groups, alignreport=sogin.unique.align.report, processors=2)