Summary.seqs
From mothur
The summary.seqs command will summarize the quality of sequences in an unaligned or aligned fasta-formatted sequence file. The only parameter is fasta and it is required:
mothur > summary.seqs(fasta=amazon.fasta) Start End NBases Ambigs Polymer Minimum: 1 422 422 0 4 2.5%-tile: 1 436 436 0 4 25%-tile: 1 507 507 1 5 Median: 1 530 530 3 5 75%-tile: 1 961 961 6 6 97.5%-tile: 1 973 973 15 8 Maximum: 1 978 978 20 9 # of Seqs: 98
For this unaligned fasta file, we see that all of the sequences started at position 1 (they're unaligned) and had between 422 and 978 bases in them. The median length was 530 bases. We can also see that more than 75% of the sequences had at least one ambiguous base in them and at least one had 20. The final column indicates the length of the longest homopolymer in each sequence - 95% of the sequences had a homopolymer length between 4 and 8 bases long.
An output file (e.g. amazon.fasta.summary) lists each of these parameters for each sequence. For example:
seqname start end nbases ambigs polymer U68589 1 943 943 10 5 U68590 1 497 497 0 6 U68591 1 930 930 1 4 ...
So we see that sequence U68589 was 943 bases long, had 10 ambiguous bases in it and the longest homopolymer in the sequence was 5 bases long.
If we had instead analyzed an aligned sequence file such as the greengenes alignment database:
mothur > summary.seqs(fasta=core_set_aligned.imputed.fasta) Start End NBases Ambigs Polymer Minimum: 69 6849 1423 0 4 2.5%-tile: 97 6849 1469 0 5 25%-tile: 109 6849 1507 0 5 Median: 109 6849 1524 0 6 75%-tile: 109 6849 1538 0 6 97.5%-tile: 109 6857 1563 3 8 Maximum: 109 6885 1609 30 11
Now we see that all of the sequences are at least 1,423 bases long, very few have any ambiguous bases and most sequences start by position 109 and end by position 6,849. These data can be useful for removing sequences that don't overlap or that have features indicating poor quality using the screen.seqs command.