Troubleshooting Few Taxon Detections in Seemingly Good Data

alexkrohn · April 28, 2023, 4:10pm

I'm wondering what I can do to figure out what went wrong in my Qiime2 analysis. Briefly, I am using cytb primers (amplicon length 235 bp) on environmental DNA to try to detect fish species.

I used rescript to create a reference database and classifier of all cytb sequences up to 20kb in length from all vertebrates on GenBank.

I imported the data as PE. The data seemed pretty high quality, as the quality of both R1 and R2 is above 30 all the way to 150 bp. I trimmed the reads to 145, used dada2 to denoise/remove chimeras etc., then used feature-classifier classify-sklearn to classify the sequences against the reference taxonomy.

Using the same primers with the same reference database, I've detected numerous fish species. Looking at the PE denoising stats here, it doesn't seem like my data are that bad.

Still, I'm not able to identify taxa below vertebrates (see feature table
here, and raw sequences here).

I'm curious how I can go about troubleshooting where the error occurred. It's surprising to get decent looking sequences that don't match to any taxa.

It seems that the problems occurred in the classification step. Maybe the classifier doesn't work well for the taxa from these locations, even though it worked well for samples from other locations? Or maybe the amplification was off such that I amplified fragments that aren't cytb and that's why they're not matching the database?

Thanks for your help troubleshooting these results

Nicholas_Bokulich · April 28, 2023, 6:47pm

Hi @alexkrohn ,

It does not look like you uploaded your taxonomic classifications so I cannot assess these.

But thanks for sharing the output of tabulate-seqs, this is really helpful.

I BLASTed just a few of these against the NCBI nt database and indeed it looks like there are not good hits... I am seeing low coverage, partial alignment, and in some cases no hit at all... anything but fish (but maybe I grabbed too few).

So it seems like even if the sequence quality scores are high, the quality of the sequences may not be (e.g., from non-target DNA or chimera???)

To troubleshoot more thoroughly, you could try qiime feature-classifier blast to test this locally against your database and see an alignment report.

Let us know what you find. Good luck!

alexkrohn · May 1, 2023, 2:59pm

Hi @Nicholas_Bokulich. Thanks for the quick reply, even on a Friday!

Here are the taxonomic classifications from qiime feature-classifier classify-sklearn, based on a classifier that I made from GenBank sequences for cytb sequences from my primer set from all vertebrates. As you can see, none get further than phylum Chordata.

I also BLASTed a few of the results in the same way that you did and found the same thing: poor matches, partial alignments, etc. I don't think this is your selection bias.

Were you suggesting that I retry my classifications with qiime feature-classifier classify-consensu-blast? Or is there another BLAST alignment function that I don't know about?

Thanks again!

Nicholas_Bokulich · May 1, 2023, 8:23pm

Hi @alexkrohn ,

No, alas, the issue is not with the classification method, it seems to be with your sequences themselves. Searching against the GenBank nt database as we have both done indicates that the query sequences themselves look quite wonky...

My suggestion to use qiime feature-classifier blast was to generate a BLAST alignment report for all query sequences against your custom database. This will really just confirm what I think we both have already seen with our random searches vs. GenBank nt: that the query sequences have poor coverage and % id matches (or no hits at all). This would at least be more quantitatively informative that the classify-sklearn output, though, as it will tell you how much coverage and %id you have, and because the search will be performed on both read orientations.

So I hate to be the bearer of bad news but the signs so far suggest that the input sequences may just be bad quality (not in terms of quality scores, but because they do not look like any fish DNA sequences currently known to mankind and are probably chimera/contaminants. This might be good news if you are doing eDNA sequencing on space dust, but otherwise this is an outcome most unexpected).

alexkrohn · May 2, 2023, 1:41pm

Got it. Yeah, that's what I had figured too. Thanks, @Nicholas_Bokulich.

Digging a bit deeper, I'm curious if you had ideas about what might cause such chimeras or contaminations. Most of these same samples did amplify and identify fish DNA using 12S primers, but obviously didn't with cytb. That makes me think this is more likely a primer or lab work issue rather than a sample issue. Do you have any more insights?

I also can't seem to find documentation about qiime feature-classifier blast. Is this from a newer qiime version? Unfortunately I'm stuck back at 2022.2 because I'm running qiime on an older machine without AVX capabilities.

Thanks again,

Alex

Nicholas_Bokulich · May 2, 2023, 3:14pm

Yeah that's my thought too — primer issues specifically, but it's just a hunch.

Yep, this was introduced later in 2022. You can also just run blastn directly (outside of QIIME 2) for the same effect.

system · June 2, 2023, 9:15pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.