How to train the classifier for V3-V4 region with 99% identity using full length seuqnces from new relase of GreenGenes-2022??

Rashmi_Ira · October 18, 2023, 6:37am

Hello All,

I want to perform functional annotation using Picrust2 plugin.

Blockquote

First of all I want to train my classifier for V3-V4 region with 99% identity for latest release of greengenes-2022

Blockquote. I tried to check files from Index of /greengenes_release/2022.10 new release of GG-2022 but confused that which file is of my use!!!
Also, I am facing problem to use my QIIME2 outputs as a input file file for PICRUSt2. Please help me out for this, so that i can start my further data processing and analysis.

Any help will be appreciated.

Thanks and Regards,

Rashmi Ira

wasade · October 18, 2023, 8:22pm

Hi @Rashmi_Ira,

I think what you would want to do is use q2-feature-classifier to extract-reads based on your primers from the Greengenes2 backbone sequences, and then train a Naive Bayes classifier on the result

Best,
Daniel

Rashmi_Ira · October 19, 2023, 9:49am

Hi @wasade

Thanks for your response.
Okay I will check it out.

Can you please guide me with few commands and also which file is actually of my use to start with read extraction!??!

Thanks in advance.

Best Regards,
Rashmi Ira

buzic · October 23, 2023, 2:31pm

Hi,

the readme file looks like you'd need the following files:

2022.10.backbone.full-length.fna.qza
2022.10.backbone.tax.qza

The following method should help point you in the right direction. First take the sequence files and trim them based on your primers (obviously I've just added a random sequence here!). You can also add truncation and min/max lengths of sequences based on your experimental design, for example:

qiime feature-classifier extract-reads \
  --i-sequences 2022.10.backbone.full-length.fna.qza \
  --p-f-primer GTGGTGGTGGTGGTGGTG \
  --p-r-primer GGACTGGACTGGACTGGA \
  --p-min-length 100 \
  --p-max-length 600 \
  --o-reads gg_12_10_ref_primer_region_seqs.qza

then use your newly trimmed sequence file along with the backbone taxonomy to train your classifier:

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads gg_12_10_ref_primer_region_seqs.qza \
  --i-reference-taxonomy 2022.10.backbone.tax.qza \
  --o-classifier gg_12_10_primer_region-classifier.qza

I hope that helps, there are lots of walkthroughs and helpful documents in the qiime2 forum and docs, for example here

Rashmi_Ira · October 23, 2023, 4:06pm

Hi @buzic

Thank you so much for your response. I will follow the same as suggested.

Best Regards,
Rashmi Ira

stephhhhanniee · June 14, 2024, 10:49pm

Hi Victoria,

Thank you so much for this example! I used this for my samples, but I was wondering how you know the % identity. Does this code default to 99% identity?

I read elsewhere in the forum that you can specify the % identity, but it was using a different code?

Any clarification would be greatly appreciated. Thank you!

Stephanie

colinbrislawn · June 15, 2024, 2:25pm

You don't! These two commands use the same features from the input database. Perhaps the input database was clustered at a set identify, or not!

Also, the act of selecting an internal region will make previous calculations of identity invalid.

Does this code default to 99% identity?

No.

There are tools designed to assist with database curation, including RESCRIPt:
https://library.qiime2.org/plugins/rescript/27/

stephhhhanniee · June 17, 2024, 1:31am

Hi Colin,

Thanks so much for the explanation! So the "act of selecting an internal region" is like specifying the primers for the V3-V4 region, for example?

Also, after I commented I was looking at my taxonomy output file and the "Confidence" column ranges from 0.72-0.99 so is that in some way connected to the % identity? I ran a different code (qiime greengenes2 non-v4-16s) which has 99% identity and my "Confidence" column was all 1.0 so I was wondering if that's connected to the % identity.

Thanks again!

Stephanie

colinbrislawn · June 17, 2024, 1:05pm

That's right. And that's what these commands do:

qiime feature-classifier extract-reads ...
and
qiime rescript extract-seq-segments

That column is the confidence (think 'confidence interval') of a query sequence's annotation, not the database sequence's identity.

Different methods report confidence differently, so it depends on the program. It's never database pre-cluster distance, though.

stephhhhanniee · June 20, 2024, 7:20pm

Oh gotcha, thank you again for your help, I appreciate it!