Losing Archaea when making classifier following RESCRIPt workflow (scikit-learn issue???) qiime2-2023.5

Rach · August 21, 2023, 1:23am

Hi,
hoping someone can help me work out what's happening. New to this and completely stumped. I ran the RESCRIPt process to get a V1-V3 classifier (SILVA 138.1) and noticed that I lost everything except bacteria. I am looking at cattle samples and they most definitely have Archaea in their gut.

I reran the RESCRIPt workflow and checked the .qzv as I went along to check where I lost the taxa.
After the 'qiime rescript dereplicate' step, I can see Archaea in the qzv. Once I run the 'feature-classifier fit-classifier-naive-bayes' and 'qiime feature-classifier classify-sklearn' steps the next qzv has nothing but bacteria.
Not exactly sure what I'm doing wrong - it may be as simple as a missed flag or a misunderstanding of the workflow. Hoping someone can help. Don't know how to debug things other than checking the qzv.

Below are the script steps and excerpts of what is seen in the qzv

Dereplicate steps:
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-seqs-filt.qza
--i-taxa silva-138.1-ssu-nr99-tax.qza
--p-mode 'uniq'
--p-threads 12
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--verbose

qiime metadata tabulate
--m-input-file silva-138.1-ssu-nr99-tax-derep-uniq.qza
--o-visualization silva-138.1-ssu-nr99-tax-derep-uniq.qzv

Archaea present at the end of this step

Then I run the next two steps and lose all Archaea in the classifier before I even make the amplicon region-specific classifier.

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--i-reference-taxonomy silva-138.1-ssu-nr99-tax-derep-uniq.qza
--o-classifier silva-138.1-ssu-nr99-classifier-full.qza
--verbose

qiime feature-classifier classify-sklearn
--i-classifier silva-138.1-ssu-nr99-classifier-full.qza
--i-reads 04dada2_output/rep_seqs.qza
--o-classification testtaxonomy.qza
--verbose

qiime metadata tabulate
--m-input-file testtaxonomy.qza
--o-visualization testtaxonomy.qzv

I've tried to work it out myself but I don't know enough about how the data is being processed to identify exactly what is going wrong (or what I did wrong).

Thanks for your help.
Rachele

Not sure what other info is needed - below is my system and qiime info:
Running on Windows 11
Processor Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz 2.30 GHz
Installed RAM 32.0 GB (31.8 GB usable)
WSL2

Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS

Conda version: conda 23.7.2

(qiime2-2023.5):~$ qiime info
System versions
Python version: 3.8.16
QIIME 2 release: 2023.5
QIIME 2 version: 2023.5.1
q2cli version: 2023.5.1

Installed plugins
alignment: 2023.5.0
composition: 2023.5.0
cutadapt: 2023.5.1
dada2: 2023.5.0
deblur: 2023.5.0
demux: 2023.5.0
diversity: 2023.5.1
diversity-lib: 2023.5.0
emperor: 2023.5.0
feature-classifier: 2023.5.0
feature-table: 2023.5.0
fragment-insertion: 2023.5.0
gneiss: 2023.5.0
longitudinal: 2023.5.0
metadata: 2023.5.0
phylogeny: 2023.5.0
quality-control: 2023.5.0
quality-filter: 2023.5.0
sample-classifier: 2023.5.0
taxa: 2023.5.0
types: 2023.5.0
vsearch: 2023.5.0

Nicholas_Bokulich · August 21, 2023, 6:34am

Hi @Rach ,

It does not look like this is an issue with the classifier per se... from what you show it looks like the issue is with the query sequences. Do you know for a fact that there are Archaeal sequences in the query? Even if they are present in the samples they could be missing in the query sequences, e.g., if you used primers that do not or poorly amplify Archaea.

Could you try the following and let me know what you find?

classify your query sequences with the pre-trained full-length classifier from the QIIME 2 website. Are Archaea detected?
If not, you could filter the references sequences to keep only Archaea sequences and then attempt to classify these with the classifier that you trained. Are they classified as Archaea?

Please let me know what you find!

Rach · August 22, 2023, 6:09am

Hi Nicholas, thanks for your help.
I have been assured that Archaea do exist in these samples and that the V1-V3 primers have been used successfully in prior experiments.

I have tried a pre-trained full classifier and once again, all I get is Bacteria after running sklearn.
I downloaded the pre-trained classifier from the QIIME2 Docs page for the 2023.5 release.

Silva 138 99% OTUs full-length sequences (MD5: b8609f23e9b17bd4a1321a8971303310)

Was this the correct one? It took a while to run as I need to reduce reads-per-batch to a manageable size or the process died.

Can I get some clarification on what you mean by filtering the reference sequences to keep only Archaea sequences (point 2)? Do you mean I should add flags to the ' qiime rescript filter-seqs-length-by-taxon' command to only keep Archaea like the example below? If not, can you let me know exactly what I should be trying?

qiime rescript filter-seqs-length-by-taxon
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
--i-taxonomy silva-138.1-ssu-nr99-tax.qza
--p-labels Archaea Bacteria Eukaryota
--p-min-lens 900 9999 9999
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt-Arc.qza
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard-Arc.qza

Sorry about all the questions - still learning the process and the language.

Cheers
Rachele

Nicholas_Bokulich · August 22, 2023, 6:18am

Hi @Rach ,

Thanks for checking.

Hm... so at this point it does not sound like you are doing anything wrong when training your own classifier. It sounds right now like the issue originates with the query sequences — as a first step would you mind making a barplot (with qiime taxa barplot) and share the results?

Yes that's the right one! When you train a primer-specific classifier the process should become more efficient/require less memory.

You can use qiime taxa filter-seqs --p-include Archaea

Rach · August 28, 2023, 6:18am

Hi Nicholas,
thank you for your help. I checked everything you suggested and was still having issues. In fact, the Archaea-specific classifier forced Bacteria to be classified as Archaea (confirmed in Blast).

I created another amplicon-specific classifier and tested it on another dataset that I know has Archaea and it worked. Therefore, the only answer is that Archaea was not in the samples that I was assured had Archaea present.
Thank you for spending so much time on this. Apologies for the amount of time this took.

Cheers
Rachele

Nicholas_Bokulich · August 28, 2023, 6:21am

Hi @Rach ,

Yes, that would be expected if you have only Archaea in the reference database.

I agree, sounds like that is the case.

Thanks for all the troubleshooting and very glad to hear that you worked it out in the end!