expected sequences in custom database and reads, but not matching in taxonomy table

jrdupuis · December 8, 2022, 5:31pm

Hello! Hoping someone can help us with a problem we’re having.

We have a known set of plant species in a positive control library (a mock community kind of thing) for the trnL gene (tRNA Leucine). The libraries contain mostly agricultural species, so plenty of reference sequences are available on genbank. We created databases from reference sequences downloaded from NCBI using RSCRIPT with a few different keyword options, and we get the same following problem with all of them (keywords “tRNA-Leu”, trnL[Title] OR chloroplast [Title] OR complete genome [Title]”, etc.).

Our data contains the sequences we’re looking for, as when we look at the feature frequency table, there are clearly some well-represented species in the data:

And if we pull the sequences for those well-represented features and blast them, they come back as what we would expect. In this case, the top feature here comes back as quinoa (Chenopodium quinoa), which was one part of the positive control samples:

Blast results of some of those well-represented features thrown into excel (in general the top blast hits are shown and they were 99-100% matches):

And if we look in the downloaded database from NCBI (both the taxonomy and the sequences themselves) that we used for classifier training, we can find quinoa sequences that are the right region of trnL and would give a 100% match to the feature above (just by comparing the sequences by eye). However, in the final qiime results (taxonomy table), these features match to other taxa, e.g. (not anywhere close to quinoa, taxonomically):

We’ve tried lowering the thresholds for minimum length matched and other filters in the classifier training process that might eliminate those matches (e.g. qiime feature-classifier extract-reads \ --p-trunc-len And --p-min-length ), and trying various database search approaches, but in all cases, we can find the expected sequences/species in our downloaded database as well as in the results from DADA2 (rep-seqstrnL.qza) and in the sequence table. BUT, for some reason the final qiime matches are hitting on other random sequences (sometimes with much lower confidence levels than the previous screen shot example, down into the 0.7 level). Has anyone had similar issues before with a custom database? Any thoughts would be greatly appreciated, and please let us know if we can provide any other info.

We’re using QIIME2 2021.2 (conda install), and have posted representative commands below.

#Downloading and filtering the reference sequences
qiime rescript get-ncbi-data \
 --p-query 'txid35493[ORGN] AND (tRNA-Leu[gene])' \
 --output-dir NCBIdata_tRNA_Leu
qiime rescript cull-seqs \
 --i-sequences ./NCBIdata_tRNA_Leu/sequences.qza \
 --p-num-degenerates 5 \
 --p-homopolymer-length 12 \
 --o-clean-sequences NCBIdata_tRNA_Leu_clean.qza
qiime rescript filter-seqs-length \
 --i-sequences NCBIdata_tRNA_Leu_clean.qza \
 --p-global-min 250 \
 --p-global-max 1000 \
 --o-filtered-seqs NCBIdata_tRNA_filtd_seqs.qza \
 --o-discarded-seqs NCBIdata_tRNA_discarded_seqs.qza
qiime rescript dereplicate --verbose \
  --i-sequences NCBIdata_tRNA_filtd_seqs.qza \
  --i-taxa ./NCBIdata_tRNA_Leu/taxonomy.qza \
  --p-mode 'super' \
  --p-derep-prefix \
  --p-rank-handles 'silva' \
  --o-dereplicated-sequences NCBIdata_tRNA_Leu_derep_seqs.qza \
  --o-dereplicated-taxa NCBIdata_tRNA_Leu_derep_taxa.qza
#Classifier training and testing
qiime feature-classifier extract-reads \
  --i-sequences NCBIdata_tRNA_Leu_derep_seqs.qza \
  --p-f-primer CGAAATCGGTAGACGCTACG \
  --p-r-primer CCATTGAGTCTCTGCACCTATC \
  --p-trunc-len 190 \
  --p-min-length 135 \
  --p-max-length 400 \
  --o-reads ref-seqstrnL_test.qza
qiime feature-classifier fit-classifier-naive-bayes \
 --i-reference-reads ref-seqstrnL_test.qza \
 --i-reference-taxonomy NCBIdata_tRNA_Leu_derep_taxa.qza \
 --o-classifier classifier4.qza
qiime feature-classifier classify-sklearn \
  --i-classifier classifier4.qza \
  --i-reads rep-seqstrnL.qza \
  --o-classification taxonomy_tRNA_4_result.qza

SoilRotifer · December 9, 2022, 5:45pm

Hi @jrdupuis, welcome to :qiime2:!

Have you tried this approach to make your database? I explicitly use the trnL marker gene as an example (though slightly different primer pair). Some general thoughts and other highlights from the tutorial:

You'll notice the --p-query is far more explicit. This is to make sure I am including the groups I would like to keep, and excluding spurious reference data.
Also, relying solely on PCR / Sequencing primer searches can be problematic.
Taxonomic miss-annotation is a common problem in reference databases and can affect the resulting consensus taxonomy, and may need to be corrected. For this reason we provide the edit-taxonomy command within RESCRIPt, to help mitigate these issues.

Perhaps run through the tutorial and let us know if this helps?

-Cheers!
-Mike

SoilRotifer · December 9, 2022, 8:26pm

@jrdupuis I forgot to ask, how many of these sequences are a 99-100% match to a Chenopodium quinoa sequence while simultaneously being 99-100% match something else? Can you paste some examples in your reply, the screen-shots are too blurry for me to discern any information.

Thanks!

jrdupuis · December 21, 2022, 2:52pm

Hi Mike,

Thanks for the suggestions! Your help is greatly appreciated. And sorry for the delay in responding. My student and I have been playing around with the extract-seq-segments protocol and have had great success! We must have had some issues with only one primer matching certain sequences in the database, but now we’re getting very sensible results.

Thanks again for the assistance!
Cheers, Julian

SoilRotifer · December 21, 2022, 3:40pm

That's fantastic @jrdupuis! Thank you for updating us on your efforts!

system · January 21, 2023, 9:41pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.