16S analysis using DADA2 and self-trained classifier fails only with large amounts of samples

clearwing · May 9, 2024, 3:29pm

Hi qiimers,

I am using qiime2 2024.2 on Windows Subsystem for Linux (WSL2, Ubuntu 22.04, 32 GB RAM, Windows 11) for 16S amplicon sequencing (V4) of formalin-fixed paraffin-embedded (FFPE) tissue on NextSeq and MiniSeq.

I have successfully analyzed my paired-end fastq files using DADA2 and silva-138-99-nb-classifier.qza from qiime2 Data resources.

The script is deriving from Importing data tutorial, “Moving Pictures” tutorial, and “Atacama soil microbiome” tutorial.

I additionally trained a Naive Bayes classifier (using 'Training feature classifiers with q2-feature-classifier' tutorial) with a database of about 2380 human pathogenic bacteria and some mitochondrial and chloroplast sequences, resulting in a 16S_V4_classifier.qza of only 2849 KB.

This works also well for single runs (~ 20 samples) - almost all sequences are classified properly.

However, if I include ALL my fastq files (104 files, 4.74 GB) and try to analyze them all together with DADA2 and the self-trained V4 classifier, most sequences are unassigned or just k_Bacteria, and only a few are classified (to genus or species level). No error messages.

If I instead use Deblur and the self-trained V4 classifier, or DADA2 and the silva classifier no problems occur.

What can be the reason for the odd behavior when using DADA2 and self-trained classifier on all samples together, and how can I overcome that?

I attach

the self-trained 16S_V4_classifier.qza,
and the scripts I used for
Training my feature classifier, and
analysis (using both silva and self-trained classifiers)
as well as
taxa-bar-plots_V4.qzv (failed)
taxa-bar-plots_V4b1.qzv (successful, part of sequences).

16S_V4_classifier.qza (2.8 MB)
training_feature_classifier.txt (900 Bytes)
qiime2_16S_paired-end.txt (2.4 KB)
taxa-bar-plots_V4.qzv (352.2 KB)
taxa-bar-plots_V4b1.qzv (373.2 KB)

colinbrislawn · May 10, 2024, 8:50pm

Hello Franz,

Welcome to the forums! :qiime2:

This is a great first post.

I took a look at the data provenance and it looks like you are doing everything right. Because the database works (sometimes), the issue must be after database creation.

In the classify_sklearn step, I see read_orientation: "auto"
This means the direction of reads is inferred by the program.

Do you know if these 16S V4 reads are in mixed orientation?
We have seen taxonomy annotation 'fall off a cliff' when the program gets the read direction wrong, so that's where I would start.
classify-sklearn --p-read-orientation same

We also have a new method for reorienting input reads, which may be helpful here.

Or it's something totally different.

Please keep us posted and let us know what you find!

clearwing · May 12, 2024, 10:05am

Thanks a lot @colinbrislawn.

That was really helpful.

The sequences are paired-end, but within a fastq file all in the same orientation.
Adding '--p-read-orientation same' to the qiime feature-classifier classify-sklearn step solved the problem.

However, I still realize that some of the (few) unassigned sequences, and some of 'k_Bacteria' (only) in fact are reverse complement sequences of bacteria matching 100% if reverse complemented (the sequences starting with CTAATCC in the attached [Deblur] file rep-seqs_V4.qzv). This is, however, a minor (but interesting) issue.

qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--i-taxonomy taxonomy_V4.qza
--o-visualization rep-seqs_V4.qzv
(found no way to include Frequency (number of reads) as well)

rep-seqs_V4.qzv (352.9 KB)

colinbrislawn · May 12, 2024, 4:52pm

Okay cool!

If you have followup questions you can reply to this thread, or open a new thread for new questions!

system · June 12, 2024, 10:52pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.