Difference in classification with complete sample set and a subsample

Hi,
While doing an 18S analysis using qiime2-amplicon-2023.9, when I tried analyzing the whole run set of 384 samples, the taxonomy hits after using silva-138 full length classifier were very less. However, when a subset of 8 samples from the same run were analyzed separately, the hits were more diverse and resembled more to an independent analysis. I verified both the runs till dada2-denoising and for the same set of samples (8 out of 384 analyzed together) and the same 8 analyzed separately, I do see that the number of reads after filtering and chimera removal are exactly the same.
Where could the issue possibly arise from?

Thank You for the help

1 Like

Hi,

This is in continuation of my previous thread. After digging more into the ASVs and their sequences, from both the complete run analysis and the analysis of the subset, I realized that for similar ASVs obtained in both the analysis, the classifier classified them well and with fairly high confidence in subset but failed to classify them in the complete run analysis.
I am using SILVA 138.1 full length pre-trained classifier from qiime2 website for both these run. Also, I am attaching a csv file which should provide a better idea of what I have written.

Thank you
Comparison_of_all_and_subset.csv (2.0 MB)

1 Like

Hello @Mudit_Bhatia,

Would you mind posting the classification command that you used?

Thank You Colin for the response.

For both the classifications, the following command was used and only the reads file used was changed.

Hello @Mudit_Bhatia,

Would you mind sharing (attaching) the two resulting classification artifacts if they're not too large?

Hi Collin,

Please find the files attached.

Thank You
taxonomy_complete.qza (960.9 KB)
taxonomy_subset.qza (182.7 KB)

Hello @Mudit_Bhatia,

This can happen either when the sequences to be classified are in mixed orientation or when there's a nonsense sequence among them that causes the orientation detector to assume the wrong direction. Do you know whether your sequencing pipeline would have resulted in mixed orientation reads? If so, use qiime rescript orient-seqs to orient them consistently. If not, I would try rerunning classify-sklearn once with the --p-read-orientation parameter set to 'same' and once with it set to 'reverse-complement' and then check those taxonomies.

2 Likes

Thank You Colin for the response.

I do not think that the sequencing pipeline is resulting in reads-with mixed orientation. If that was the case, would the ASVs generated be exactly the same? Because in the first file I shared,

the common ASVs generated for both the analysis (complete and subset) are exactly the same, however, they are being aligned differently. For the subset, they have a high confidence and aligned to eukaryotes but in the complete analysis, they have a much lower confidence and either being assigned to bacteria or unassigned.

I will try to run the analysis with reverse-complement and get back.

Thank You for your time and consideration.

Hello @Mudit_Bhatia,

If that was the case, would the ASVs generated be exactly the same?

I'm not sure, but it seems possible. I think that more likely is that there is nonsense sequence in the total set that confused the orientation finder and tanked everything.

1 Like

Thank you Colin for the response.

I did the complete analysis of the run in subsets and got most of the subsets which resulted in fairly nice classification. I am trying to stick to Silva-138.1 classifier which is available on the qiime2 website as I have been having issues training the feature classifier by myself, something I am discussing here

I have tried using both Rescript protocol and the protocol for qiime2 versions before 2024.2. Is there any other way I can try

without having to train the feature classifier?

Hello @Mudit_Bhatia,

Just to be clear did you try running the complete set with the same classifier but using the --p-read-orientation parameter?

Thank You @colinvwood for the response. I missed that point earlier but I did run it and I see that when I use --p-read-orientation 'same', the classification works well and it does not do a proper classification when the --p-read-orientation 'reverse complement'.

That does solve my issue but if you do not mind, could you please explain why the default parameter --p-read-orientation 'auto' did not do the job?

Thank you for your support

Hello @Mudit_Bhatia,

The hypothesis was, as suggested by @Nicholas_Bokulich, that there was a nonsense/garbage read that was chosen by the orientation checker and caused the direction of the reads to be misinterpreted. Hardcoding the orientation didn't allow this to happen. In your subset the nonsense read either wasn't present or wasn't chosen by the orientation checker, so you didn't have the problem there.

1 Like

Thank You for the support and explanation!

1 Like