Reference Alignments: Best Way to Download Many Sequences from NCBI GenBank

alexkrohn · June 7, 2021, 9:40pm

TL;DR
What's the best way to create a custom reference database that downloads reference sequences from GenBank for over 400 specified taxa, and four genes of interest, then format them for use in Qiime2? How can the plan below be improved?

Background:
I'm conducting an eDNA survey where we want to determine the presence of ~400 animal taxa in many different watersheds. We have primers for 16s, 12s, COI, and cytb that are known to work for the taxa we're interested in, and know that at least one of those genes has been sequenced in each taxon and is on GenBank. I'm new to metabarcoding and Qiime2, so I wanted to make sure my plan was a good idea before going too far down a rabbit hole.

Plan
I figured I would analyze the genes separately -- e.g. one round of Qiime2 for each gene, and one reference database for each gene.

To create the custom reference database for each gene, I wanted to create a script (probably in R at first, but eventually in Snakemake) that would loop/apply RESCRIPt over a list of our target taxa. E.g. for COI sequences from Anas fulvigula, the script would run RESCRIPt with --p-query 'Anas fulvigula[Organism] AND COI[Gene Name]', output the Feature[Taxonomy] and Feature[Sequence] files, then move on to the next organism. When the script finishes creating all the .qza files, it would concatenate them together to create one giant reference database for all sequences of COI in the taxa of interest. After that I would continue filtering, culling, de-replicating etc. to then train the classifier.

Questions

I see myself using a for loop/apply function here and know there's probably a better way. Would it be better/faster to just run one giant search for each taxon separated by OR in RESCRIPt? Or faster to just use entrez, then format the sequence/taxonomy data to import into Qiime2? Or something else entirely?
It seems like merge and merge-seqs will be the best tools to concatenate the Feature[Sequence] and Feature[Taxonomy] files together. Is that right?
Or, am I missing something entirely, and it might just be best to download all animal sequences for the gene of interest (e.g. one for COI, one for 16S etc.), then filter, and align accordingly? Seems pretty maniacal , but certainly simpler!

Thanks for your help!

Nicholas_Bokulich · June 8, 2021, 6:32am

Welcome to the forum @alexkrohn !

Sounds like you are on the right track with RESCRIPt!

I'd go for the latter. Would be slow to loop 400 times, also slow to merge. Also your provenenance graph would look really ugly!

Probably not faster.

merge-seqs is correct, but merge will merge feature tables, not FeatureData[Taxonomy] artifacts. There is a separate merge-taxa action (one in RESCRIPt with more complex functionality, one in q2-taxa that has simpler functionality).

Depending on your use case, this might be the best approach overall (and RESCRIPt can be used to download the entire query), minus the filtering part (if you must filter, I recommend the restricted query approach above). For most applications you would want a complete database, not only the 400 taxa of interest, so that you can get more reliable estimates of confidence for taxonomic classification, clustering, etc, and avoid issues with misclassification. Also I realize that you are focusing on animals, but it helps to include in the database any non-targets that are amplified by the same primers.

On the other hand, I also realize that for eDNA surveys it can be useful to take geographic range information into account, but building a really restricted database always makes me nervous since it is making the bold assumption that only those 400 species exist in your geographic range. I have been curious about using geographic species distribution data with q2-clawback to improve taxonomic classification with COI and other eDNA markers... the question is where/how to get species frequency information for those markers. You could put together this frequency information artificially, so that you give all species you expect to find a 1, and all other species in the database a 0, then convert to relative frequency and pass that table to q2-clawback for training your classifier. The disadvantage is that this would be an arduous process, but the advantage is that you would shed this assumption that only those 400 species can be observed... rather you would say that these 400 species are much much more likely to be observed but you are keeping an open mind to the strange things that happen in the world so in other words: your classifier would not be prone to misclassification or other issues.

To save you some time, here are some tutorials using RESCRIPt for SILVA, NCBI, and for COI genes. The SILVA and COI tutorials also link to downloads for the pre-compiled sequences so that you can save yourself a few days in compiling (and in the case of COI, aligning) the entire databases.

Good luck!

alexkrohn · June 8, 2021, 2:36pm

Thanks for the quick and detailed reply, Nicholas!

Interesting! For our purposes, we're really most concerned with presence-only data. Rather than truly understanding everything that's present at a site, we really want to know for sure whether a site contains any of the ~400 species of conservation concern.

Even though we would likely just ignore any of the bacteria, viruses, plants, etc. in our final report, your point about using all the taxa to improve classification, clustering and avoid misclassification is important. Maybe it would be worth it to run two classifiers, one with all of the sequences for a particular gene, and one just based on the 400 species of interest. I bet the former will produce a lot of extraneous information (bacteria, plants, way out-of-range misclassifications etc.), while the latter may produce more false positives and/or taxa misclassified to close relatives.

By filtering, I meant using the steps outlined in the tutorials (removing degenerate nucleotides, homopolymers, duplicate seqs, etc.) rather than filtering by taxa. I assume you still recommend doing that kind of filtering, right?

Perfect! These were the tutorials I had already started to work from. Out of curiosity, given that NCBI really has the greatest diversity of animal sequences, is it worth it to use other databases? While NCBI might misclassify some bacterial taxa given that SILVA (or Greengenes) might have more sequences, I think it would be best for capturing closely related animal sequences. Do you agree?

If you do think I should search each database, do you suggest I run separate searches then merge them together using the merge-taxa action, or just create entirely new/different classifiers then compare the efficacy of the classifiers?

Thanks again for your help!

alexkrohn · June 8, 2021, 2:36pm

Two more questions! For very large downloads (e.g. all 16S sequences on GenBank), is there any way to preview or estimate the size of the resulting files? For example, it seems there are ~41 million 16S sequences on GenBank. I would filter those sequences by length and using our primers to avoid getting whole genomes, but even at ~1000 bp, that's still >> 50 GB of data. Is there any better way to estimate the sizes?

These pre-compiled databases would be quite useful. I don't see the links to them in the tutorials, though. Maybe they got erased during an update? Do you have a direct link?

Thanks again!

Nicholas_Bokulich · June 8, 2021, 2:56pm

Hi @alexkrohn ,

Sounds like we are on the same page!

But my points are still a concern — it has nothing to do with abundance, only that you do not want to mis-classify or overclassify by using an incomplete database (i.e., one that does not capture the full range of potential diversity at a site, including non-targets that are amplified by the primers). My point about q2-clawback is less about estimating abundance in your samples, but rather using existing observation data to adjust the taxonomy classification by leveraging data on the likelihood of observing a specific species.

To see a great example of what can go wrong with a restricted database, see this paper:

It's worth the comparison, but see the paper above — that pretty much tells you the worst-case scenario. In a real dataset these effects might be more subtle... e.g., misclassification of a near neighbor as the wrong species because the near neighbor is not represented in your database.

Oh got it right! I misunderstood. Yes, filter sequences based on sequence quality by all means! Since you mentioned 400 species I thought you meant that you only wanted a database with those

Depends on the genetic marker. See the COI RESCRIPt tutorial and linked pre-print for some insight from @devonorourke on comparing NCBI vs. BOLD (I believe BOLD came out a little bit better).

I do not know enough about the animal SSU/LSU sequences in SILVA — @SoilRotifer would have better insight there.

Different marker genes/genetic regions should have separate databases/classifiers. I would not merge all genes together without a really good reason. I also would not merge different databases (SILVA, NCBI) because the taxonomies have different structures/nomenclature and your hair will turn grey

So I guess I am not 100% clear — above it sounded like you would merge different QUERIES. I would just do 1 big query with RESCRIPt (or maybe a few batches of multiple queries so you can break it up into reasonable chunks and don't wait 3 days for many GBs to download). But I would not merge anything other than queries of the same marker gene/region.

enter your query directly into NCBI Genbank, and see how many results are found. (Obviously at that point you could choose to "download all" but you would (a) not have the formatted taxonomy, (b) not have QIIME 2 provenance to record the query that you used for all eternity).

I recommend starting with the RefSeqs databases for any marker genes that you can... e.g., 16S and various other common marker genes have associated RefSeqs databases that each have their own Project IDs that can be used to download the latest database. See the tutorial above for the 16S example. The RefSeqs databases contain only the marker gene of interest (no need to filter out whole genomes) and are curated... grabbing all 16S sequences would contain lots of misannotated/unknown sequences.

Oh I see, thanks! I think the links did get deleted since locations/versions were updated a few times since writing.

The COI is available here (see links in README): GitHub - devonorourke/tidybug: 🐞 methods and data describing filtering COI datasets and database construction 🐛
The SILVA is available here: Data resources — QIIME 2 2021.4.0 documentation

The NCBI 16S RefSeqs is not posted anywhere but takes little time to run since it is a small database.

Good luck and thanks for the interesting discussion!

alexkrohn · June 8, 2021, 3:40pm

This is a great example. Thanks for the link.

You are correct, I would do 1 big query with RESCRIPt, but that would only be from one database. It seems like you're suggesting others (SILVA, BOLD etc.) might be better, or at least give different results. In that case I'd have to create separate classifiers for each database (one for GenBank, one for SILVA, one for BOLD etc.), then compare their efficacy.

Many of the taxa we're interested in have sequences for the genes published (on GenBank), but don't have anything in RefSeq. For example, the fish Etheostoma blennioides has two partial 16S sequences (link), but no 16S RefSeqs (link). The fish Etheostoma caeruleum has 5 partial 16S sequences in GenBank, but only their entire mt genome is present in RefSeq ([link](https://("Etheostoma caeruleum"[Organism] OR Etheostoma caeruleum[All Fields]) AND 16s[All Fields] )). It seems like I would not be able to detect either of these species if I only restricted my reference database to RefSeq samples. Is that true?

Maybe a better solution would be to download the 16S RefSeq, then merge that large database with all of the 16S sequences from a query that includes all 400 of the species of interest. That way I'd have a diversity of samples from the RefSeq, plus enough sequences from the sepecies of interest to know I'd detect them if they were present. Does that make sense?

Nicholas_Bokulich · June 8, 2021, 3:56pm

This has been done to some extent for different databases/marker genes, so I am not recommending that you test all... only that for some marker genes one database might be better than another. NCBI should be good enough and/or best for all markers (depending on the marker).

Right! Good point.

Yes I think that makes sense... maybe grab RefSeqs for each marker gene, find the species not represented, and query NCBI Genbank for the missing species with RESCRIPt, then merge.

Good luck!

alexkrohn · June 10, 2021, 1:32pm

This is very helpful. Thanks again!

alexkrohn · June 13, 2021, 10:56pm

One more question, @Nicholas_Bokulich. The NCBI RefSeq queries (either searched separately with 16S[Gene Name] AND refeq[filter] or using 33175[BioProject] OR 33317[BioProject] as in the tutorial) only contain ~20,000 sequences, most of which are bacteria. If I'm really trying to include a broad diversity of life in my reference database, shouldn't I also include fungi, and other eukaryotes (in addition to the ~400 targeted species that I'll add later)? Or do you think just the ~20,000 reference bacterial sequences, plus the 400 targeted species' sequences is sufficient diversity to avoid false positives?

Thanks again.

Nicholas_Bokulich · June 14, 2021, 6:09am

Hi @alexkrohn ,
It's sort of a complex question and depends on the PCR primers that you use, e.g., whether you will get amplification of SSU from eukaryotes... many/most 16S primers do not amplify the 18S rRNA gene in euks, and I believe the RefSeqs 16S is specifically for bacteria + archaea, so the query result makes sense.

But you are absolutely correct, if you expect amplification of euks you should also include these in your database. The SILVA database includes 16S + 18S by default so would be a good one to start with here (unless if there is an 18S RefSeqs database that you could merge with the 16S... I am not sure).

Good luck!

alexkrohn · June 14, 2021, 2:25pm

For 16S, we're using two primer sets from this paper that specifically target vertebrates. Neither of them specify amplifying 18S in addition to the 16s region.

Even SILVA doesn't seem to have any multicellular eukaryotes. It seems like, if the primers really are vertebrate specific, it would make sense to include as many vertebrate 16s sequences as I could find. I could also include the RefSeq 16s dataset from GenBank to cover my bases (bases -- get it?) in case the primers aren't as specific as I had hoped.

Does that seem wise to you?

Nicholas_Bokulich · June 14, 2021, 6:36pm

in that case you should use an 18S reference, not a bacterial 16S, though it might be wise to combine as you propose.

The SSU rRNA subunit is 16 svedberg units in bacteria and 18 in eukaryotes, so those primers would more properly be called 18S primers.

But based on that paper I am not sure that those primers are vertebrate-specific, so it might be worth using a 16S + 18S SSU rRNA gene database just in case you get some non-target hits (e.g., to bacteria). Or just BLAST the primers against RefSeqs 16S to make sure there is no predicted amplification before proceeding with a pure 18S database...

I never thought to look for that! Looks like there are, based on a quick look, but sounds like it would not cover the vertebrates that you want to find, based on your previous description. In any case, mixing SILVA with NCBI would be a mess, since the taxonomies are a bit different, so it would be best to stick with one or the other.

Sounds like an exciting project!

Good luck!

alexkrohn · June 15, 2021, 8:44pm

Thanks again!

I decided to run two comparisons as we talked about: one using a reference database of all vertebrates + all refseq, and then one with just the sequences from the ~400 species of interest. I'm interested, as in the paper you sent me, just to see how many false positives I get with the restricted database.

However, it doesn't seem possible to run a giant query using rescript for 400 taxa. (I tried in the form of:

qiime rescript get-ncbi-data --p-query '("SPECIES1"[Organism] OR "Species2[Organism]" OR "Species3[Organism] ...... ) AND (16S[Title] OR 16S rRNA[Title])' --o-sequences 16s-400spp-refseqs-unfiltered.qza --o-taxonomy 16s-400spp-refseqs-taxonomy-unfiltered.qza --p-n-jobs 5

Where SPECIES1, 2, 3 etc. is the species' binomial, and the .... indicates 397 more.)

Qiime2 gave the error that they could not construct a URL with my query, which I assumed was because of the length.

So, that brings me back to my thoughts of a for loop in bash running rescript for each species, then merging using Qiime. I feel like there should be a better way! Any idea what I'm missing?

Nicholas_Bokulich · June 16, 2021, 3:39pm

Hi @alexkrohn ,

Sounds like it — this might be on the end of NCBI, I believe there are query length restrictions and we have seen other users report errors when using very complex queries.

some alternatives to think about:

split your query into chunks of species names, maybe 20 at a time, to allow many small queries. This will be faster than a giant loop.
probably a better solution: request specific accession numbers by inputing a TSV of accession numbers instead of using a query. Since you have 400 species these accession numbers could be a list of reference sequences for those species.
if you know that those species are all present in genbank, and you are happy with having additional species in your query, you could instead grab all NCBI sequences from the clade that encomasses all of your target species (chordata? hopefully deeper than phylum level).

sorry none of these are a great solution... you have a really specific need here! ...and generally if you want 400 specific taxa, no more no less, 400 targeted requests (via a loop with rescript) might just be what needs to happen at the end of the day to get exactly what you need.