Missing taxa using Unite classifier for ITS2

I have soil samples, ITS2, V3-V4. I only took forward sequence for analysis. I used the following code for training UNITE classifier. I found major phyla missing in taxa bar plot comprising about 20 -25% of all phyla.

I would be grateful if you could help me.

#Import the UNITE sequences
qiime tools import \
   --type 'FeatureData[Sequence]' \
   --input-path sh_qiime_release_2024/ sh_refs_qiime_ver10_dynamic_all_04.04.2024.fasta \
   --output-path unite-seqs.qza


 # Import the UNITE taxonomy
qiime tools import \
   --type 'FeatureData[Taxonomy]' \
   --input-format HeaderlessTSVTaxonomyFormat \
   --input-path sh_qiime_release_2024/sh_taxonomy_qiime_ver10_dynamic_all_04.04.2024.txt \
   --output-path unite-taxonomy.qza


# Train the classifier
 qiime feature-classifier fit-classifier-naive-bayes \
   --i-reference-reads unite-seqs.qza \
   --i-reference-taxonomy unite-taxonomy.qza \
   --o-classifier unite-classifier.qza

# Classify your sequences
 qiime feature-classifier classify-sklearn \
   --i-classifier unite-classifier.qza \
   --i-reads rep-seqs.qza \
   --o-classification taxonomy.qza

I have also attached the taxa bar plot.
taxa-bar-plots.qzv (1.8 MB)

Thank you!

Hi @umanand, your output looks fine to me.

Can you provide more information? What is it you are expecting? What is the purpose of your study?

Normally, quite a bit of quality control is performed even after taxonomy assignment. That is, removing anythign that does not have at least a phylum level assignment, unassigned sequences, etc...

A good place to start is the QIIME 2 tutorials, I'd recommend this

1 Like

Hello Mike,
My objective is to find out how the crop rotation and Nitrogen fertilizer rate affect on soil microbial communities. I am looking for the fungal community composition especially for Arbuscular mycorrhizal fungi(Glomeromycota). What I notice is even we do level 2, the top four phyla information is missing, it shows Kingdom fungi and nothing more than that. I was wondering why is that, is there any mistake in my code or I was not able to train the classifier?

Thank you!
Urmila

1 Like

Likely due to DNA extraction or the quality of the data. Again, your file looks quite typical. Are you sure that the primers you used are good at targeting arbuscular mycorrhizal fungi? Usually, one picks the primer set that best targets the organisms of interest.

Have you performed and data filtering and analysis yet? I'd not worry too much about taxonomy until you perform some alpha and beta diversity analysis with your ASVs. That will address your question about how Nitrogen fertilization affects your fungal communities. The analysis is performed at the ASV level. If you see changes based on your treatment then the taxonomy won't matter much anyway, unless you have other questions specific to taxonomy.

I noticed that your truncation length for DADA2 denoise-single is 285 bases. That is quite long. How is the quality in that part of the sequence? If the values are below 25 or 20 I'd try for a shorter truncation length. Otherwise there is too much noise for the denoiser to disambiguate true base changes from PCR / sequencing error and you'll obtain spurious ASVs (assuming low quality in that region).

1 Like

I used the following primers mentioned by sequencing facility
ITS2_Primer.

Please find the attached file of demux.qzv.
demux-single-end.qzv (291.8 KB).

I followed the following steps for denoising

Denoise the sequences using DADA2:
qiime dada2 denoise-single \
   --i-demultiplexed-seqs demux-single-end.qza \

Trim primers and adapters using cutadapt:
qiime cutadapt trim-single \
  --i-demultiplexed-sequences demux-single-end.qza \
  --p-front TCGATGAAGAACGCAGCG \
  --p-error-rate 0.1 \
  --o-trimmed-sequences trimmed-seqs.qza

4. Extract the ITS2 region using q2-itsxpress:
qiime itsxpress trim-single \
    --i-per-sample-sequences trimmed-seqs.qza \
    --p-region ITS2 \
    --p-taxa F \
    --o-trimmed trimmed-itsxpress.qza
    --p-trim-left 0 \
    --p-trunc-len 285 \
    --o-table table.qza \
    --o-representative-sequences rep-seqs.qza \
    --o-denoising-stats denoising-stats.qza

Thanks!
Urmila

Actually, those are nice quality scores. When possible I like to set my truncation to the position just before the 'bottom of the box' goes below 30. If I was to do this with you data I'd go for being super strict and set the truncation length to ~245 (if using the forward read only). But many are okay with 20 -25. Which I think is okay for merging paired-ends on occasion. You could also or to be more lenient, set the truncation length to ~278, were 'bottom of the box' is 25.

There are many opinions on this, so you'll just have to play around and see what works. Often you are stuck with the data you have. Again, I suggest you do some preliminary analysis first. In fact you can compare weather or not the truncation settings affect your data interpretation.

1 Like

Hello Mike,
Thank you for explaining so well. I was actually taking forward sequence only, but I think I should try taking both sequences. When I took both F and R sequences for 16S, the percentage of merged non-chimeric sequences were only 36%. Do you think, this percentage is enough for further sequencing or you suggest taking only forward sequence?
I will be grateful for your suggestions.

Thank you!

Are you losing data at the merge or the chimera filtering. If merging than I suggest playing around with the truncation parameters, and/or try deblur for denoising. If you are losing many reads due to chimera checking I'd suggest setting DADA2's parameter --p-min-fold-parent-over-abundance 8 as per:

1 Like

Hi Mike,
Yes, I lost most during merging.
I am wondering if we could use the AMF specific primers while using UNITE classifier instead of the primers used by the genomic center?

Thanks,
Urmila

In this case, I'd recommend only using the forward read. Otherwise, you'll bias your interpretation of the data by only retaining ITS reads that you can merge. Basically, I often follow the advice from this article: Parsing ecological signal from noise in next generation amplicon sequencing.