16S taxonomic analysis

AndreaC · September 22, 2021, 8:09am

Dear all,

I am trying to replicate the results elaborated by another research group on a taxonomic analysis of the 16S, but I keep getting completely different percentages than theirs.

Version 132 of Silvadb, Archive , was used, using the files / sequences corresponding to 99% identity, using the q2-feature-classifier.

For the analysis, I performed the following commands as read on the tutorial "Training feature classifiers with q2-feature-classifier" Tutorials — QIIME 2 2021.8.0 documentation

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path path/silva_132_99_16S.fna \
  --output-path path/silva_132_99_16S.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path path/taxonomy_7_levels.txt \
  --output-path path/ref-taxonomy.qza
  
  qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads path/silva_132_99_16S.qza \
  --i-reference-taxonomy path/ref-taxonomy.qza \
  --o-classifier path/16Sclassifier.qza

After getting the classifier.qza, I did the following as per tutorial "moving pictures" “Moving Pictures” tutorial — QIIME 2 2021.8.0 documentation

qiime tools import \ 
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path path/se_33_manifest \
  --output-path path/paired-end-demux.qza \
  --input-format PairedEndFastqManifestPhred33
  
  qiime demux summarize \
  --i-data path/paired-end-demux.qza \
  --o-visualization path/demux.qzv
  
  qiime tools view path/demux.qzv
  
  qiime dada2 denoise-single \
  --i-demultiplexed-seqs path/paired-end-demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 200 \
  --o-representative-sequences path/rep-seqs-dada2.qza \
  --o-table path/table-dada2.qza \
  --o-denoising-stats path/stats-dada2.qza
  
qiime metadata tabulate \
  --m-input-file path/stats-dada2.qza \
  --o-visualization path/stats-dada2.qzv
  
  qiime tools view path/stats-dada2.qzv
 
  mv path/rep-seqs-dada2.qza path/rep-seqs.qza
  mv path/table-dada2.qza path/table.qza
  
  qiime feature-classifier classify-sklearn \
  --i-classifier path/16Sclassifier.qza \
  --i-reads path/rep-seqs.qza \
  --o-classification path/taxonomy.qza
   
 qiime metadata tabulate \
  --m-input-file path/taxonomy.qza \
  --o-visualization path/taxonomy.qzv
  
   qiime tools view path/taxonomy.qzv
  
  
  qiime taxa barplot \
  --i-table path/table.qza \
  --i-taxonomy path/taxonomy.qza \
  --o-visualization path/taxa-bar-plots.qzv
  
qiime tools view path/taxa-bar-plots.qzv

Since as reported in a previous topic SILVA 16S database , I managed to get the percentages identical to those provided using kraken2 and the specific silva16S database, I deduce that the data provided are correct, and I am continuing to make something wrong in using qiime2.

What am I still doing wrong?

Thanks

AC

thermokarst · September 24, 2021, 11:52pm

Hi @AndreaC!

This is a really broad, open-ended question, that I'm not sure we can specifically answer - at least not with the current summary provided.

High-level, the workflow you've shared looks reasonable to me, and I don't see any immediate flags. I think if you're observing discrepancies with your colleagues you should take a step back and figure out all of the individual steps in the two workflows, and ensure that you're applying the same kinds of steps at the same points.

If you have any more specific questions we will be glad to take a shot at answering them. :qiime2:

AndreaC · September 27, 2021, 3:14pm

Thanks for the reply.

However, I cannot understand how it is possible to have such big differences using different databases, and totally different results by selecting part of the same database. Either depending on the database used you can get totally different results randomly, or there is an error in my procedure for creating the classifier.qza (I think it is MUCH more probable).

At this point my specific question is: what could I have done wrong in creating the classifier.qza?

The reference article for the primers used to amplify the V3 region of 16S is this (Probio_Uni, Probio_Rev): Assessing the Fecal Microbiota: An Optimized Ion Torrent 16S rRNA Gene-Based Analysis Protocol

where is the hitch?!

Thanks

A.C.

thermokarst · September 28, 2021, 5:47pm

I'm not sure I understand the question. Wouldn't you expect different databases to produce different results?

That question has been discussed extensively elsewhere on this forum, I encourage you to to have a look around. In particular, you might want to have a look at the rescript plugin. If you have any additional or new questions, please open a new topic or topics.

:qiime2:

AndreaC · September 29, 2021, 2:21pm

I'm not sure I understand the question. Wouldn't you expect different databases to produce different results?

Using different databases, I expect that there is a deviation from the value of an acceptable percentage (1-5%), not that a microbial class goes from 20% to 60%!

Are such large differences justified by a different database? For this I suspect that I am making mistakes in creating the database

I tried to contact the supplier of the analyzes, their procedure and their parameters remain top secret (not being able to replicate the results is unscientific! )

If the above procedure is correct for you, can I consider it as valid as theirs, and are the different results justified by a different database even if totally different?

following your tutorials I don't seem to have made mistakes, are there any parameters or checks that I missed that could have compromised the results?

Thanks for the support

A.C.

thermokarst · September 29, 2021, 2:44pm

To me that sounds like it has less to do with the database used, and more to do with the upstream processing steps appliedto your actual dataset, like QA/QC etc.

Please see this resource, courtesy of @jwdebelius:

https://www.nature.com/articles/nbt.4229

Please see @SoilRotifer's response to your question from a few weeks ago:

AndreaC · September 29, 2021, 2:58pm

To me that sounds like it has less to do with the database used, and more to do with the upstream processing steps applied, like QA/QC etc.

I have held the database accountable as I have kept QA / QC constant in my tests .... I will also check this possibility.

Thanks for your help, I will study the valuable information you have provided me.