Hello! Hoping someone can help us with a problem we’re having.
We have a known set of plant species in a positive control library (a mock community kind of thing) for the trnL gene (tRNA Leucine). The libraries contain mostly agricultural species, so plenty of reference sequences are available on genbank. We created databases from reference sequences downloaded from NCBI using RSCRIPT with a few different keyword options, and we get the same following problem with all of them (keywords “tRNA-Leu”, trnL[Title] OR chloroplast [Title] OR complete genome [Title]”, etc.).
Our data contains the sequences we’re looking for, as when we look at the feature frequency table, there are clearly some well-represented species in the data:
And if we pull the sequences for those well-represented features and blast them, they come back as what we would expect. In this case, the top feature here comes back as quinoa (Chenopodium quinoa), which was one part of the positive control samples:
Blast results of some of those well-represented features thrown into excel (in general the top blast hits are shown and they were 99-100% matches):
And if we look in the downloaded database from NCBI (both the taxonomy and the sequences themselves) that we used for classifier training, we can find quinoa sequences that are the right region of trnL and would give a 100% match to the feature above (just by comparing the sequences by eye). However, in the final qiime results (taxonomy table), these features match to other taxa, e.g. (not anywhere close to quinoa, taxonomically):
We’ve tried lowering the thresholds for minimum length matched and other filters in the classifier training process that might eliminate those matches (e.g. qiime feature-classifier extract-reads \ --p-trunc-len And --p-min-length ), and trying various database search approaches, but in all cases, we can find the expected sequences/species in our downloaded database as well as in the results from DADA2 (rep-seqstrnL.qza) and in the sequence table. BUT, for some reason the final qiime matches are hitting on other random sequences (sometimes with much lower confidence levels than the previous screen shot example, down into the 0.7 level). Has anyone had similar issues before with a custom database? Any thoughts would be greatly appreciated, and please let us know if we can provide any other info.
We’re using QIIME2 2021.2 (conda install), and have posted representative commands below.
#Downloading and filtering the reference sequences
qiime rescript get-ncbi-data \
--p-query 'txid35493[ORGN] AND (tRNA-Leu[gene])' \
--output-dir NCBIdata_tRNA_Leu
qiime rescript cull-seqs \
--i-sequences ./NCBIdata_tRNA_Leu/sequences.qza \
--p-num-degenerates 5 \
--p-homopolymer-length 12 \
--o-clean-sequences NCBIdata_tRNA_Leu_clean.qza
qiime rescript filter-seqs-length \
--i-sequences NCBIdata_tRNA_Leu_clean.qza \
--p-global-min 250 \
--p-global-max 1000 \
--o-filtered-seqs NCBIdata_tRNA_filtd_seqs.qza \
--o-discarded-seqs NCBIdata_tRNA_discarded_seqs.qza
qiime rescript dereplicate --verbose \
--i-sequences NCBIdata_tRNA_filtd_seqs.qza \
--i-taxa ./NCBIdata_tRNA_Leu/taxonomy.qza \
--p-mode 'super' \
--p-derep-prefix \
--p-rank-handles 'silva' \
--o-dereplicated-sequences NCBIdata_tRNA_Leu_derep_seqs.qza \
--o-dereplicated-taxa NCBIdata_tRNA_Leu_derep_taxa.qza
#Classifier training and testing
qiime feature-classifier extract-reads \
--i-sequences NCBIdata_tRNA_Leu_derep_seqs.qza \
--p-f-primer CGAAATCGGTAGACGCTACG \
--p-r-primer CCATTGAGTCTCTGCACCTATC \
--p-trunc-len 190 \
--p-min-length 135 \
--p-max-length 400 \
--o-reads ref-seqstrnL_test.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref-seqstrnL_test.qza \
--i-reference-taxonomy NCBIdata_tRNA_Leu_derep_taxa.qza \
--o-classifier classifier4.qza
qiime feature-classifier classify-sklearn \
--i-classifier classifier4.qza \
--i-reads rep-seqstrnL.qza \
--o-classification taxonomy_tRNA_4_result.qza