Reference Alignments: Best Way to Download Many Sequences from NCBI GenBank

Nicholas_Bokulich · June 8, 2021, 6:32am

Welcome to the forum @alexkrohn !

Sounds like you are on the right track with RESCRIPt!

I'd go for the latter. Would be slow to loop 400 times, also slow to merge. Also your provenenance graph would look really ugly!

Probably not faster.

merge-seqs is correct, but merge will merge feature tables, not FeatureData[Taxonomy] artifacts. There is a separate merge-taxa action (one in RESCRIPt with more complex functionality, one in q2-taxa that has simpler functionality).

Depending on your use case, this might be the best approach overall (and RESCRIPt can be used to download the entire query), minus the filtering part (if you must filter, I recommend the restricted query approach above). For most applications you would want a complete database, not only the 400 taxa of interest, so that you can get more reliable estimates of confidence for taxonomic classification, clustering, etc, and avoid issues with misclassification. Also I realize that you are focusing on animals, but it helps to include in the database any non-targets that are amplified by the same primers.

On the other hand, I also realize that for eDNA surveys it can be useful to take geographic range information into account, but building a really restricted database always makes me nervous since it is making the bold assumption that only those 400 species exist in your geographic range. I have been curious about using geographic species distribution data with q2-clawback to improve taxonomic classification with COI and other eDNA markers... the question is where/how to get species frequency information for those markers. You could put together this frequency information artificially, so that you give all species you expect to find a 1, and all other species in the database a 0, then convert to relative frequency and pass that table to q2-clawback for training your classifier. The disadvantage is that this would be an arduous process, but the advantage is that you would shed this assumption that only those 400 species can be observed... rather you would say that these 400 species are much much more likely to be observed but you are keeping an open mind to the strange things that happen in the world so in other words: your classifier would not be prone to misclassification or other issues.

To save you some time, here are some tutorials using RESCRIPt for SILVA, NCBI, and for COI genes. The SILVA and COI tutorials also link to downloads for the pre-compiled sequences so that you can save yourself a few days in compiling (and in the case of COI, aligning) the entire databases.

Good luck!