Problem training classsifier with Green Gene database

aredwan · February 10, 2020, 6:33pm

I have downloaded the most recent fasta and OUT taxa files to train my own classifier. But after the very first step, I am getting this where my “Feature ID” consists of numbers only and no letters. I can see the letters are misplaced in the “Taxon” section instead. I have looked at the solutions regarding the spacing in one of the past problem statement and followed the code below

qiime tools export
--input-path 99_taxonomy.qza
--output-path 99_taxonomy-with-spaces
qiime metadata tabulate
--m-input-file 99_taxonomy-with-spaces/taxonomy.tsv
--o-visualization 99_taxonomy-as-metadata.qzv
qiime tools export
--input-path 99_taxonomy-as-metadata.qzv
--output-path 99_taxonomy-as-metadata
qiime tools import
--type 'FeatureData[Taxonomy]'
--input-path 99_taxonomy-as-metadata/metadata.tsv
--output-path 99_taxonomy-without-spaces.qza

But it is still the same. I am using Qiime2/2019/04 version. Is that also a concern why the classifier trainer module is not working? If someone has recently trained a classifier of Green gene data base or NCBI open source, please share with me the codes. I really need to fix this urgently as I have a time constrain. I am not being able to fix this even after going through multiple solution routes, may be because I am not sure what is the actual problem here.

Mehrbod_Estaki · February 10, 2020, 8:56pm

Hi @aredwan,
If time is not on your side, I would recommend to just use a pre-trained classifier on the data-resource page. There you can find a greengenes or SILVA pre-trained classifier for the full length as well as V4 specific region. There are also a few other pre-trained classifiers floating around in the Community Contributions section for other regions ex V3-V4, as made available by other users, you can try searching for those as well.

aredwan · February 11, 2020, 5:15pm

Thanks a lot. But if I use a pre-trained one, that will not perfectly represent my data. So, if I want to train classifier, do you have any better idea how can I get rid of the problem that I am facing?

Mehrbod_Estaki · February 11, 2020, 11:15pm

Hi @aredwan,

Which 16S region did you sequence? Which primer sets? If you used the same primers as the ones used in the readily available classifiers then it would be perfect for your data. If you use the full-length classifier it would still perform very well with any 16S region regardless of your primers, perhaps just a little bit worse than a classifier trained with the extracted region. But my suggestion was/is offering a trade-off between a little accuracy vs time.

If you are training a classifier using the Greengenes database, this tutorial would step-by-step perfect for you (except you would import the 99_otus/taxonomy instead). I have used this to train Greengenes-based classifiers many times without any spacing issues that you mention, so I can't tell what the issue is here without actually seeing your workflow or artifacts.

Can you provide the link to this thread?

aredwan · February 12, 2020, 7:48pm

Hello, Thanks a lot for your time and valuable inputs. These are my primers I used for sequencing 16S V3-V4 region (not same as used for readily available classifiers): forward : GTGCCAGCMGCCGCGGTAA; Reverse: GGACTACHVGGGTWTCTAAT.

This is the code I used for sequencing: # Sequencing Denoisisng (Paired), Forward/Reverse sequence Truncation
qiime dada2 denoise-paired
--i-demultiplexed-seqs primer_trimmed_demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 270
--p-trunc-len-r 250
--p-chimera-method consensus
--o-table dada2_table.qza
--o-representative-sequences dada2_rep_seqs.qza
--o-denoising-stats dada2_denoising_stats.qza \

I have followed exact steps the tutorial link you shared. And these are the files I got after inputting the taxonomy file in .txt format [99_taxonomy.qza|attachment]99_otu_taxonomy.zip (2.3 MB)
99_taxonomy.qza (2.5 MB)

The first two steps are very easy but still the 99_taxonomy.qza file I am getting has the issue in "Feature ID" I shared in my first post with the image. Following are the codes I used to generate 99_taxonomy.qza file. I think the problem is in this first step as when I am looking at the .qzv format of this file (image attached in the first post) you can clearly see the the "Feature ID" is not looking like what it is supposed to. Please help me solve this problem, and let me know if you need any other info from me. Thanks in advance.

qiime tools import **

--type 'FeatureData[Sequence]' **

--input-path 99_otus.fasta **

--output-path 99_otus.qza

qiime tools import **

--type 'FeatureData[Taxonomy]' **

--input-format HeaderlessTSVTaxonomyFormat **

--input-path 99_otus_taxonomy.txt **

--output-path 99_taxonomy.qza

Mehrbod_Estaki · February 12, 2020, 8:34pm

Hi @aredwan,

Those primers you listed are actually the common EMP V4 primers and not V3-V4, and the pre-trained classifiers from the V4 data resource page I linked above were trained using those exact ones. So you can easily just use the pre-trained classifier instead of training your own.
The issue with Feature ID values not being the hashed UUID might needs some looking into.

Just to clarify here, there is no rule for these values in the first column to be numbers, letters, or a combination such as hashed IDs, the values here will simply reflect whatever your features were called in your previous step. In your cases, DADA2.

I'm not sure what exactly you are referring to here, but I don't see this anywhere in the image you posted. If you are referring to something like the o__YLA114;, those are actually the taxa name in the greengenes database. So, no issues there.
Where I do see something odd is that the Feature ID is showing up as just numbers even though you say you ran DADA2 without the --p-no-hashed-feature-ids parameter. This is likely stemming from a previous step that may have replaced those feature-ids with something else. Can you think of any step that may have caused this to happen? If you could share this taxonomy.qza and your rep-seqs.qza file I could look into it a bit further through the provenance.

aredwan · February 13, 2020, 5:02pm

Hello,

I can not but thank you much for all your valuable thoughts. I am glad to know that for my primers set, I can actually use a classified trainer. I am a PhD student and Texas Tech University. The high performance computer through which I run QIIME 2 has the latest version of 2019.04. So, the latest green gene classifier of 13.08 I was trying to use form the QIIME 2 data resources, it was showing an error that it can not be run. In that case if you could kindly share one of your trained Green gene classifier, I will be highly grateful.

And here are the data you asked, if you can please let me know what might be the reason for odd "Feature ID", as when I am trying to run the last step of classifier for taxonomic assignment, the error is occurring due to ID mismatch between my raw data source and the taxonomoy.qza file.[dada2_rep_seqs.qza|attachment]
(upload://dZiXHUPi6ery1hFxZ9p73m0Azcx.qza) (20.4 KB) taxonomy.qza (855.0 KB)

I think with this chat here I am getting closer. Thanks and please let me know if you need any more info.dada2_rep_seqs.qza (20.4 KB)

Mehrbod_Estaki · February 14, 2020, 12:11am

Hi @aredwan,
Thanks for providing these files.
Can you describe where you obtained your initial taxonomy file named otu_id_to_greengenes.txt ? the filename that you download from the greengenes database should be called something like gg_13_5_taxonomy so I'm wondering if you have somehow imported the wrong file.

The most updated version of QIIME 2 is currently 2019.10 with an even newer version scheduled to arrive sometime soon (in the next couple of weeks I believe). I would recommend you speak with your admin of your super computer about updating to the newest version.

Was this error like the one described here? The pre-trained classifier you downloaded from the 2019.10 version linked above was trained using a newer version of the sci-kit learn algorithm. If you can't update your qiime2 version, simply download the same classifier from an older version of Qiime2, in your case from the 2019.4 data resource page.

aredwan · February 15, 2020, 5:04pm

Hello,

I downloaded otu_id_to_greengenes.txt from the Greengene website. I was thinking this was the wrong file where it all started. In that case is not gg_13_5_taxonomy older than 13/8 files. And can you share the right file or source, so that I can use the right file.

I have already requested to update the version to latest QIIME to my admin 5 days ago, so that is in progress.

And the last suggestion you provided regarding green gene classifier, I will go that route and let you know if the problem is fixed.

Thanks a lot!

Mehrbod_Estaki · February 15, 2020, 10:30pm

Hi @aredwan,
Downloading that older classifier will be your fastest option while updating to newer Qiime2 will be your best option.

Download the greengenes files from the data resource page I linked above and look into the taxonomy folder. The filename should be called as I mentioned.
Good luck!

aredwan · February 17, 2020, 4:44pm

Hellow @Mehrbod_Estaki, I have used green gene classifier from Qiime 2 2019.04 version, which seems to be working. Thanks a lot to you. This is the file I got dada2_99otus_taxonomy.qza (38.0 KB)
I could generate the bar plot also without a problem. But while I was trying to generate alpha and beta diversity from the phylogenetic tree using this code:

qiime diversity core-metrics-phylogenetic **

--i-phylogeny rooted_tree.qza **

--i-table table.qza **

--p-sampling-depth 750 **

--m-metadata-file metadata.tsv **

--output-dir core-metrics-results

it is showing this error: Plugin error from diversity:
"None of the sample identifiers match between the metadata and the coordinates. Verify that you are using metadata and coordinates corresponding to the same dataset."

Is this still related to my classifier? or what might be the reason. I think I am in the finishing line, and special thanks to you for that.

Mehrbod_Estaki · February 17, 2020, 8:05pm

Hi @aredwan,
Glad you got your classifier issue resolved!
The new error you are seeing is unrelated to this. As the error suggests this happens when the sample-ids in your feature-table do not match the sample-ids in your metadata file. Double check those to make sure they are the same. You can also search that error message on the forum as this has been reported (with solutions) a few times already there. If you are still having issues afterwards, please start a new post and we'll help you troubleshoot there. Thanks

system · March 20, 2020, 2:05am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.