A large number of bacteria_ unassigned reads in the final feature table

Hi everyone,
Wishing all a very happy 2020.

I have a problem with the Qiime2 analysis. I have done my data analysis in different ways using Qiime2 commands. In my first analysis, I have used cutadapt to trim my raw sequences and then did dada2 denoise using the following parameters:

qiime cutadapt trim-paired --i-demultiplexed-sequences bac-paired-end-demux.qza --p-cores 28 --p-front-f CCTACGGGNBGCASCAG --p-anywhere-f AGATCGGAAGAG TGGAATTCTCGG GATCGTCGGACT CTGTCTCTTATA CGCCTTGGCCGT --p-front-r GACTACNVGGGTATCTAATCC --p-anywhere-f AGATCGGAAGAG TGGAATTCTCGG GATCGTCGGACT CTGTCTCTTATA CGCCTTGGCCGT --o-trimmed-sequences trim-bac-paired-end-demux.qza --verbose

qiime dada2 denoise-paired --i-demultiplexed-seqs trim-bac-paired-end-demux.qza --p-trunc-len-r 0 --p-trunc-len-f 0 --p-max-ee-f 4.0 --p-max-ee-r 4.0 --p-trunc-q 2 --p-n-threads 0 --o-table trim-5-paired-table.qza --o-representative-sequences trim-5-rep-seqs-paired.qza --o-denoising-stats trim-5-denoising-stats-paired.qza --p-chimera-method consensus --verbose

I am attaching the denoising stats results along with the table obtained:
trim-5-denoising-stats-paired.qzv (1.2 MB) trim-5-paired-table.qzv (739.0 KB)

After the Open reference OTU picking using Green gene database, chimera removal using qiime vsearch uchime-denovo command, taxonomic classification using qiime feature classifier classify-sklearn and filtering the mitochondrial and chloroplast sequences using qiime taxa filter-table and qiime taxa filter-seqs, we obtained the following final table:
table-nc-wobl-no-mitochondria-no-chloroplast.qzv (2.5 MB)

The final feature table with the taxonomy information was obtained by the qiime tools export, biom add metadata and biom convert commands. I am attaching the feature table with taxonomy.tsv file.
feature-table-with-taxonomy.tsv (686.9 KB)

But when I checked the OTU information in the feature table, there is a dominant number of OTUs assigned only up to bacteria_ level. They even account for the top 10 bacterial taxa for most of my samples.

I don’t know how to solve this problem. Also when I copied the respective OTU sequences and compared in Ezbiocloud database the sequence was getting classified upto species level and above 99% similarity was shown to typestrain sequences deposited in the database.

But the sequence orientation was found to be reverse in the result.

I am wondering whether the sequence orientation has something to do with OTU picking and taxonomic assignment. Is there any steps to check the raw forward and reverse read orientation and solving this issue.

Sorry for such a long query. Looking forward for solutions as I am stuck and confused at this stage.

Hello Femi,

Thank you for your detailed question. Several things can influence taxonomy classification, so let’s discuss them.

  1. Change the Algorithm: :hammer_and_wrench:
    You tried classify-sklearn to some success, but maybe classify-consensus-vsearch would work better for your data set.
  2. Change the Database: :card_file_box:
    Given that you used open-ref OTU picking against GreenGenes, you probably used this for taxonomy classification. Maybe SILVA would work better. (It’s much newer!) Or, try using the Ezbiocloud as a customer database.
  3. Keep it simple: :hugs:
    Adding extra steps, like a second round of chimera checking and OTU clustering after DADA2 denoising, can be helpful… or not. What if you tried running the taxonomy classifier on the dada2 features?

This is a tricky problem with no one answer. Let us know what you try next!


P.S. Wait… let me ask about this:

What is your sequencing platform?
EDIT: What is the region you sequenced?

Thanks Colin for your reply. My sequencing platform was Illumina Hiseq. The region we have sequenced was V3-V4. I have tried using SILVA also for the same analysis and got kind of same results. i.e. too many just bacteria_ reads.
But as you have suggested I will try the classify-consensus-vsearch as well as the taxonomy classifier on the dada2 features.
If you could suggest how to customize Ezbiocloud database, it would be great help.
Thank you once again.

Femi Thomas

This is important! Some of the classifiers were trained only on the V4 region, and so will not work (at all!) on other regions. Try using vsearch and a full length database.

Or, try building your own classifier, which will work great for SILVA with V3-V4 or with the Ezbiocloud database.

And of course, let us know if you have any questions,

Thanks once again Colin. I have trained the SILVA as well as Ezbio database with V3-V4 region and then used for classify-consensus-vsearch. With Ezbio database the result is good (the featuretable is attached)
feature-table-ezbio-with-taxonomy-femi.tsv (880.2 KB)
But there are a lot of unassigned OTUs. I guess this is because Ezbio only deals with the typestrain sequences if I am not wrong. So now I am wondering how to deal with the unassigned OTUs in the result. Your suggestion of directly using the dada2 denoise results for classify-consensus was useful. I could get more number of OTUs as compared to the initial results. Now let me just try with the silva database for the same and see how it works.
Do you have any suggestions regarding the unassigned OTUs when using Ezbio database?

1 Like

Use SILVA instead :wink:

No, seriously, using both is a good idea. You run two different feature classifiers: a standard database like silva, and a specialized database like Ezbio. For 90% of your analysis, graphs, etc you can use the normal database then discuss the alternative taxonomy for specific ASVs based on Ezbio.

Best of both worlds :earth_africa: