I am using qiime2 2024.2 on Windows Subsystem for Linux (WSL2, Ubuntu 22.04, 32 GB RAM, Windows 11) for 16S amplicon sequencing (V4) of formalin-fixed paraffin-embedded (FFPE) tissue on NextSeq and MiniSeq.
I have successfully analyzed my paired-end fastq files using DADA2 and silva-138-99-nb-classifier.qza from qiime2 Data resources.
The script is deriving from Importing data tutorial, “Moving Pictures” tutorial, and “Atacama soil microbiome” tutorial.
I additionally trained a Naive Bayes classifier (using 'Training feature classifiers with q2-feature-classifier' tutorial) with a database of about 2380 human pathogenic bacteria and some mitochondrial and chloroplast sequences, resulting in a 16S_V4_classifier.qza of only 2849 KB.
This works also well for single runs (~ 20 samples) - almost all sequences are classified properly.
However, if I include ALL my fastq files (104 files, 4.74 GB) and try to analyze them all together with DADA2 and the self-trained V4 classifier, most sequences are unassigned or just k_Bacteria, and only a few are classified (to genus or species level). No error messages.
If I instead use Deblur and the self-trained V4 classifier, or DADA2 and the silva classifier no problems occur.
What can be the reason for the odd behavior when using DADA2 and self-trained classifier on all samples together, and how can I overcome that?
I attach
the self-trained 16S_V4_classifier.qza,
and the scripts I used for
Training my feature classifier, and
analysis (using both silva and self-trained classifiers)
as well as
taxa-bar-plots_V4.qzv (failed)
taxa-bar-plots_V4b1.qzv (successful, part of sequences).
I took a look at the data provenance and it looks like you are doing everything right. Because the database works (sometimes), the issue must be after database creation.
In the classify_sklearn step, I see read_orientation: "auto"
This means the direction of reads is inferred by the program.
Do you know if these 16S V4 reads are in mixed orientation?
We have seen taxonomy annotation 'fall off a cliff' when the program gets the read direction wrong, so that's where I would start. classify-sklearn --p-read-orientation same
We also have a new method for reorienting input reads, which may be helpful here.
Or it's something totally different.
Please keep us posted and let us know what you find!
The sequences are paired-end, but within a fastq file all in the same orientation.
Adding '--p-read-orientation same' to the qiime feature-classifier classify-sklearn step solved the problem.
However, I still realize that some of the (few) unassigned sequences, and some of 'k_Bacteria' (only) in fact are reverse complement sequences of bacteria matching 100% if reverse complemented (the sequences starting with CTAATCC in the attached [Deblur] file rep-seqs_V4.qzv). This is, however, a minor (but interesting) issue.
qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--i-taxonomy taxonomy_V4.qza
--o-visualization rep-seqs_V4.qzv
(found no way to include Frequency (number of reads) as well)