Sorry for the delayed response @Nicholas_Bokulich - was traveling to and from a conference. It was a great opportunity to talk a bit about the new tools in QIIME2 with other users, and now I’m back to thinking about my reference database construction. I really appreciate your comments above and have already employed the exclude-seqs
argument with a bat-reference database (to filter out bat sequences). However, I’m now thinking about the arthropod-only database. This comes from BOLD, but I’ve been applying a few other filtering criteria for BOLD.
Perhaps this is a question that QIIME already has an answer: I’m wondering about how to preserve as much taxonomic information for a sequence which contains redundant records. For instance, if I run qiime vsearch dereplicate-sequences
(info here), it’s not clear to me how to retain the record (sequence + taxonomic info in header) with the most complete taxonomic information. Or, what would happen if there were two identical sequences which contained equally complete but distinct taxonomies.
A post back in April on the developers forum here suggested that the --p-derep-prefix
can resolve issues for pseduo-replication of sequence variants, but it’s my understanding this doesn’t address anything to do with taxonomic information.
Consider the information in the following two files:
refseqs.fasta
>seq1
AATTCCGG
>seq2
AATTCCGG
>seq3
TAGTAGTA
>seq4
TAGTAGTA
refseqs.txt
seq1 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__;g__;s__
seq2 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
seq3 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Pterostichus;s__Pterostichus tristis
seq4 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus
We can see from refseqs.fasta that there are a pair of identical sequences (seq1
and seq2
; as well as seq3
and seq4
). If I was to dereplicate these data, in a perfect world, the outcome would be:
-
seq1
andseq2
are concatenated into a single representitive sequence, however, becauseseq2
contains complete taxonomic information and no other identical sequence contains any opposing information at equivalent levels, I’d preserve the full taxonomic identities ofseq2
.- My concern is that if the program simply retains the first taxonomic entry, then it will lose out on the potential information contained in redundant sequences
-
seq3
andseq4
contain identical sequences, and both contain full taxonomic records through to a species level. However they disagree at both the species and genus ranks. In this case, I’d like a derplicated record to agree where they share a least common ancestor - the Family in this case.
Thus the desired output for a dereplicated fasta would be:
>derep-seq1
AATTCCGG
>derep-seq2
TAGTAGTA
and the associated dereplicated taxonomic file would be:
derep-seq1 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
derep-seq2 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__;s__
Hopefully that makes sense. I’m sure you smart microbiologists have been thinking about these things already. Unfortunately the QIIME docs I’ve come across are sparse on considerations when building your own database - I can see why, after starting to have to deal with the web of complications I’m hitting now!
Thanks for your help,
Devon