Feature-clasifier V3-V4

Classifier works without extract-reads, but collapses to one phylum after trimming (BEExact, V3–V4)

Hello, I am analysisn some sequences of the V3-V4 16S region. I trained a clasifier and the classifier worked great without trimming, but the moment I trimmed references to my primer sites, everything collapsed into basically one phylum. SLURM jobs were fine, no error messages… just awful classifications.
The setup (nothing exotic)

  • References: BEExact full-length 16S (BEEx_FL-refs_sequences.qza + taxonomy).

  • Target region: the usual V3–V4 primers:

    • F: CCTACGGGNGGCWGCAG

    • R: GACTACHVGGGTATCTAATCC

  • Trim command (QIIME 2 extract-reads):

    qiime feature-classifier extract-reads \
      --i-sequences BEEx_FL-refs_sequences.qza \
      --p-f-primer CCTACGGGNGGCWGCAG \
      --p-r-primer GACTACHVGGGTATCTAATCC \
      --p-min-length 100 \
      --p-max-length 500 \
      --o-reads BEEx-V3V4-refs_sequences.qza
    
    

    After trimming, references only dropped from 20,099 → 20,072. So no, I didn’t accidentally destroy the database.

Training on those trimmed refs gave me barplots where almost everything became the same group. While if I don´t trim and train on full-length refs I get diverse taxonomy. Any idea as to what I am doing wrong here?

Hi @irenedecarlos, I assume you are following the protocol from the BEExact GitHub page on how to make an amplicon specific classifier?

For the qiime feature-classifier classify-sklearn command, what did you set for --p-confidence? The default is 0.7, but I noticed that their tutorial is set 0.5. I am not sure I'd advise using this setting as you might run the risk of erroneous classification. But I'm not as experienced with this database as others might be. Others might have more insight.

Hi @SoilRotifer,I also saw that the BEEexact tutorial uses 0.5, but I kept the default one :).

1 Like

Thanks @irenedecarlos,

Well, if you followed their tutorial and used the default --p-confidence value, I suppose you can try their recommendation of using 0.5.

I would also sanity-check your data by trying to classify your reads using using Greengenes, SILVA, or RDP. If they all return poor classifications, then that might be a sign that there is something wrong with the data? Even if the quality is good... perhaps too many off targets or host DNA? :man_shrugging:

Hi @irenedecarlos,

It is also quite possible that this might be a mixed-orientation read issue. That is the naïve bayes classifier requires that reads match the sequence orentation of the reference database. Fortunately, the latest version of RESCRIPt has a couple of tools for this:

qiime rescript orient-reads ...
^^ Just provide any reference database FASTA formatted artifact as your reference, along with your imported paired-end FASTQ artifact, then your reads should hopefully be re-oriented properly. Then you can proceed with DADA2, ... classification, and see if they improve.

and
qiime rescript orient-seqs ...
^^ If you have an already merged FASTA file artifact you can reorient these. Then retry classification. Though I prefer to do this with FASTQs, as I worry about potential denoising issues.

Hi @SoilRotifer , thanks a lot for the inputs. I have tried other classifiers (silva) and it works well, also I trained the beexact clasifier without the trim to V3-V4 and worked as well. Its only when I trim for the V3-V4 region… I have tried reorienting the reads as you proposed but I still have the same issue, I have gotten in contact with the creator of beexact to see if he can help! Thanks for the replies :).

Hi @irenedecarlos,
One other thing to try: you can run qiime feature-table tabulate-seqs on the BEEx-V3V4-refs_sequences.qza file that you created, and the corresponding FeatureData[Taxonomy] artifact could be provided as well. That would let you actually look at the sequences post-trimming, and make sure they look like what you expect. You could potentially try searching that file as well with some of your sequences to see if they hit - if not, that might provide some insight into what's going on.

Is there any chance that the region you're trimming to doesn't match what was sequenced (e.g., the sequencing was actually V2, but you're trimming to V3-V4)? (I don't mean to suggest you're making a silly mistake - it's just that this result would be what I would expect if the trimming was incorrect, so just want to throw that idea out there so you can confirm.)

1 Like