Many K_bacteria; ; ; classifications even in deep levels

MiriamGorostidi · September 23, 2020, 11:41am

Hello!!

I have some doubts with the taxonomic results I am getting. At first taxonomical level, it seems the analysis has been done correctly (the unassigned % is not too big, aprox 10%).

However, when I go through deeper levels, even if the unassigned % remains the same, there is an assignation that is called K_bacteria, which comprises almost the 80-90% of the assignations, that does not give more information (I attach some pictures, so the problem can be better understood):.

The version of Qiime2 I am using is QIIME 2 Core - 2020.6 and it is installed in Oracle VM VirtualBox.
The analysis I have performed is based on the "Moving Pictures" tutorials, using DADA2.
The samples correspond to feces of patients and controls.

Don't know if more information (exact commands or more pictures) is necessary in order to make easier to help me.

Thank you too much in advanced!

Best regards,

Miriam

timanix · September 23, 2020, 11:52am

Hi!
I think you should provide additional information so the Qiime2 team members could help you to resolve your issue.
Did you train your classifier by yourself or you downloaded already pretrained?
Are you sure that you used the classifier that trained for the same rRNA region as your amplicons?

MiriamGorostidi · September 23, 2020, 12:35pm

Hello @timanix!

Yes! You are right! i'm sorry, but I have already started using qiime2 some days ago and I'm pretty lost.

The classifier I used was Greengenes 13_8 99% OTUs and I downloaded it by the following commands, as it is described in the "Moving Pictures" tutorial.

wget
-O "gg-13-8-99-515-806-nb-classifier.qza"
"https://data.qiime2.org/2020.6/common/gg-13-8-99-515-806-nb-classifier.qza"

For the amplicons, I used the Ion 16S Metagenomics Kit from ThermoFisher, that amplifies 7 of the hypervariable regions, using two primers (Primer 1: V2, V4, V8 and Primer 2: V3, V6-7, V9).
Now I am a bit confused, because I don't actually know in which region does the classifier work.

Really thankful for your answer

timanix · September 23, 2020, 2:13pm

This primer for V4 region, and you have

Definitely it is a reason why you got a lot of K_bacteria;, but now I am curious how to analyse such data in a proper way.

llenzi · September 23, 2020, 3:24pm

Hi @MiriamGorostidi

I am not really familiar with the kit you used, but do you know the length of the expected amplicons? You mentioned you followed the moving picture tutorial,
How many sequences passed the denoising step?

One way to analyse these data may be to use a closed reference clustering instead the de-novo clustering approach you used (there may be other way but I can not think any that does not imply the extraction of the sequences for each amplicons ), please refer to: Clustering sequences into OTUs using q2-vsearch — QIIME 2 2020.8.0 documentation.

Hope it helps

MiriamGorostidi · September 24, 2020, 8:34am

Hello @llenzi

The following link provides you more information about the kit Ion-16S Metagenomic Kit. In that flyer it seems that the kit works with a length of 400bp. But, the rep-seqs.qzv that I get as a result of DADA2, shows that the reads have a length of 175.

What you ask about the amount of sequences that passed the denoising step, I hope It is answered with the following picture:

I guess that to have just a %60 of sequences that passed the filter is not a good point.

Finally, Does the De-Novo clustering step appear in the Moving Picture Tutorial? I have checked if I performed that step and I did not. In which part of the process should I apply these clustering processes (even De-Novo or Closed-Reference)? After DADA2?

Thank you for your help

MiriamGorostidi · September 24, 2020, 8:34am

Yes @timanix, I guess the difference between the regions could be the problem… Hope someone can help us

llenzi · September 24, 2020, 9:07am

Hi @MiriamGorostidi

I forgot the Moving Tutorial works by denoising single (either with dada2 or deblur).
So you are working with R1 only, that is why all sequences are 175 bp, which I guess is your trimming length. Do you have paired end reads? If so and you want to merge them, please have a look at the ATACAMA soil tutorialhttps://docs.qiime2.org/2020.8/tutorials/atacama-soils/

What length are your sequences? If you have 2x250bp, with an amplicon of about 400bp, you should be able to get enough overlap to merge them.

On the percentage of sequences passing the filters, it is a bit on the low side but it may be enough to work with, depend on the complexity of your data.

Sorry for the confusion on de-novo or closed reference! What I meant with de-novo is on the fact that you assigned taxonomy to the denoised sequences, as opposite to clustering your sequences by aligning them to your reference database (and specifying a minimum similarity threshold) as described in the tutorial with vsearch. That would be my last resource if all the other possibilities fail!

As for the classifier you used, it was trained on the v4 region only and that would certainly explain the result you seeing! You may try to train your own classifier, I suppose using the whole genes (that is skipping the 'qiime feature-classifier extract-reads' step).

Hope it helps

MiriamGorostidi · September 24, 2020, 10:12am

Hi @llenzi

Yes exactly, the --p-trun-len I chose is 175.
At first we thought that the reads were paired-end, but when I tried to convert the .bam file I was given, to 2 .fastq files, it was impossible. So I started reading some reviews and I found that, while Illumina offers single or paired-end, IonTorrent does not have the option of paired end. So I guess the reads are Single-end.

I'm sorry but I don't know where to find out the length of the sequences. Is it in this table?

I will try with the vsearch tutorial (WHERE CAN I FIND IT?) and training my own classifier (I Don't really know how I'm supposed to do this neither) then and tell you what happens

Thank you so much!!

Miriam

llenzi · September 24, 2020, 10:42am

Hi @MiriamGorostidi,

if the sequences are from IonTorrent, you are right in say that the reads are single reads but the error profile is different from the Illumina reads so you should use: the dada2 denoise-pyro denoise-pyro: Denoise and dereplicate single-end pyrosequences — QIIME 2 2020.8.0 documentation

For training your database I suggest to use the brand new Rescript plugin: Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

As alternative, regarding trying the clustering approach, please look at:
https://docs.qiime2.org/2020.8/tutorials/otu-clustering/
In the 'Closed-reference clustering' part. For this you don't need to train the database.

Hope it helps

MiriamGorostidi · September 24, 2020, 10:48am

Oh!! Thank you so much Luca, for being so fast answering and for providing so much help and so clear information

Will write here updated news about everything!!

Best regards

system · October 25, 2020, 4:48pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.