Taxonomic assignment with UNITE database: most reads classified as "s__unidentified"

Hi everyone,

I am for the first time using Qiime2 to analyze ITS data (Illumina Miseq V3 - 300 bp paired-end reads) for fungal endophytes diversity in plant roots.
I received different folders from the sequencing facility with sequences in different stages that were pre-processed. I started the pipeline importing primer-clipped sequences -> trimming with Q2-ITSxpress -> dada2 to identify sequence variants (truncation length \ to 0 because the data quality was good).

For classification, I downloaded the latest UNITE database (https://files.plutof.ut.ee/public/orig/98/AE/98AE96C6593FC9C52D1C46B96C2D9064291F4DBA625EF189FEC1CCAFCF4A1691.gz).

At the end, I got the "taxa-bar-plots.qzv" file and for my surprise, the was very little diversity in my samples. The same pattern I have seen in a different data set that I analyzed. My question is: is there a chance that I am doing something wrong and ending up with this classification?

I know that I have some negative controls with contamination, but even though I should be able to get a more refined taxonomic assignment, right?

Thanks a lot,
Danilo

Off the top of my head, I'm not sure. I would recommend sharing the exact commands that you ran here as well as the actual .qzv. you mentioned if possible. That will make it much easier for somebody to spot-check your analysis. :slightly_smiling_face:

Hi @andrewsanchez ,

Thank you for your reply. I tried with a different database and I got different results (see attached taxa-bar-plot.qzv) but still with many taxa unidentified. What i did different this time was to change the UNITE database to a more recent one (sh_qiime_release_s_04.02.2020.tar.gz).
But I’m also not sure on which version of UNITE I should use: the one that “Includes singletons set as RefS (in dynamic files)” or “Includes global and 97% singletons” or if that makes no difference.

Below you can see the commands I use:

*qiime tools import
–type ‘SampleData[PairedEndSequencesWithQuality]’
–input-path manifest3
–output-path paired-end-demux.qza
–input-format PairedEndFastqManifestPhred33V2

qiime itsxpress trim-pair-output-unmerged
–i-per-sample-sequences sequences.qza
–p-region ITS1
–p-taxa F
–o-trimmed trimmed.qza

qiime itsxpress trim-pair-output-unmerged
–i-per-sample-sequences sequences.qza
–p-region ITS1
–p-taxa F
–p-cluster-id 1.0
–p-threads 2
–o-trimmed trimmed_exact.qza

qiime dada2 denoise-paired
–i-demultiplexed-seqs trimmed.qza
–p-trunc-len-r 0
–p-trunc-len-f 0
–output-dir dada2out

qiime feature-table summarize
–i-table dada2out/table.qza
–o-visualization tableviz.qzv

Downloading the UNITE database (release date 2020.02.20 - Includes global and 97% singletons)
wget https://files.plutof.ut.ee/public/orig/01/38/0138B5D5EA2C77B8C2E5B910202FD3E60A9244FC31084E08DAD63E213A03BBFB.gz

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path developer/sh_refs_qiime_ver8_dynamic_s_04.02.2020_dev_uppercase.fasta
–output-path unite.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-path developer/sh_refs_qiime_ver8_dynamic_s_04.02.2020_s_04.02.2020_dev.txt \
  --output-path unite-ver8-99-tax-04.02.2020.qza \
  --input-format HeaderlessTSVTaxonomyFormat

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads unite-ver8-99-seqs-04.02.2020.qza \
  --i-reference-taxonomy unite-ver8-99-tax-04.02.2020.qza \
  --o-classifier unite-ver8-99-classifier-04.02.2020.qza

qiime feature-classifier classify-sklearn \
  --i-classifier unite-ver8-99-classifier-04.02.2020.qza \
  --i-reads dada2out/representative_sequences.qza \
  --o-classification taxonomy.qza

qiime taxa barplot \
  --i-table dada2out/table.qza  \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file metadata3 \
  --o-visualization taxa-bar-plots.qzv

[taxa-bar-plots.qzv|attachment](upload://mwVPl6yUiv2Y99kkYySUPDgx7Ux.qzv) (377.5 KB) 

Thank you for your help!
1 Like

taxa-bar-plots.qzv (377.5 KB)

Hi @Danilo_Reis ,
Two hypotheses for the low diversity: either (1) too many reads were lost during QC, or (2) your reads are hitting junk reads in the UNITE database, e.g., abnormally short seqs.

Troubleshooting/solutions:

  1. look at your dada2 stats and feature table summaries, keeping an eye on if/where reads are lost. If you are losing many reads during merging, analyze single-end reads instead of paired-end.
  2. Use RESCRIPt to filter out abnormally short/long sequences, and maybe q2-taxa to remove any unidentified sequences from UNITE, if desired:
    Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

Note: I changed the title to be more descriptive. Thanks!

Hope that helps!

1 Like

Hi @Nicholas_Bokulich ,
Thank you so much for your reply. I think I figured out the problem. I tried two different UNITE databases: one including only fungal sequences and another with all eurkaryotes sequences.
Those sequences that were previously not assigned to any fungal phyla are actually plant sequences :sob:
I'm looking at root-associated fungal communities in 10 different plant species and this happened in some of them, especially those that are known to be less colonized by fungi.
I'll filter out the plant sequences and work only with the fungal ones. I just don't know whether this low number of fungal reads is enough to compare my samples. What do you think?

cheers,
Danilo

1 Like

Not sure — filter and pray :pray:

you can do something like alpha rarefaction curves after filtering to see if the reads you have are sufficient to saturate species diversity...

Good luck!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.