Post analysis taxonomy results

Phuc_Hu_nh_Van · August 28, 2020, 5:54am

Hello everyone. I have used qiime2 to analyze 16S from the GreenGene database. The result looks like this:

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Oceanospirillales; f__Oceanospirillaceae; g__Marinomonas 0.2%
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Oceanospirillales; f__Oceanospirillaceae; g__ 0.05%
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Oceanospirillales; f ; g 0.01%

Should I eliminate the rates of 0.01% and 0.05% when considering genus classification? Because only one genus Marinomonas has been identified.
In the articles, I didn't see them presenting results like f_; g_ and almost only the genus that know the name appears on the graphs and they have a 100% overall rate. I look forward to sharing your analytical experience after having similar results.

jwdebelius · August 28, 2020, 3:07pm

Hi @Phuc_Hu_nh_Van,

You're seeing 2 things here. First, the greengenes database lacks lower level annotations (genus and spcies most frequently) for a lot of organisms. This partially due to limitations in our biological understanding and partially due to the information available to the classifier when the database was built. So, your send string,

Says the classifier you ran could find a genus, but the database doesn't have a genus name. (In greengenes, nameless genera are denoted g__;)

In your other example,

You couldn't classify the sequence past family level, meaning the genus isn't there.

Which I present this kind of data, I typically present the first as "unspecfied f. Oceanospirillales" and the second as "unclassified f. Oceanospirillales", but different people have different naming conventions. I'd keep all three sequences because there's something informative in all three of them. I tend to drop things that can't be classified or specified at higher levels - my cutoff is typically phylum - because there's a high probability that those are suprrious.

There are two options to improve your classification/ability to name. One option is to switch to Silva. This is a larger database, but it's newer. So, some of the things that are unnamed in greengenes may be named in Silva. You could also try the clawback plugin (tutorial here), which uses environment bespoke classifiers.

Im sure there are also several good threads that I'm not thinking of at the moment. You might check the "best of" tag and try searching for more information.

Best,
Justine

devonorourke · August 28, 2020, 5:40pm

Would it be ridiculous to do the following?:

classify with GreenGenes
classify with SILVA
merge taxa with qiime rescript merge-taxa, using --p-mode 'super'

I'm sure @SoilRotifer and @Nicholas_Bokulich might have some thoughts on value (or lack thereof!) of such an approach, but my thinking was that by merging taxonomies between two classifiers, you will retain the classification with the most complete amount of information.

+1 for clawback if it works for your experiment.

Good luck!

Nicholas_Bokulich · August 28, 2020, 6:02pm

Good call @devonorourke... an ensemble approach like this is precisely one of the reasons merge-taxa was added to RESCRIPt (especially the "super" mode). There is some precedent in the literature for different types of ensemble classification, but should be done carefully... because it can also break things is not used carefully (e.g., because Greengenes is several years old and contains some outdated taxonomic names so can have some nomenclature disagreements with SILVA)

devonorourke · August 28, 2020, 6:35pm

Great - thanks for the great feedback.
I'm using these tools at the moment to combine BOLD and NCBI datasets, though it takes a bit more work because I have to:

first combine the taxa,
combine the sequence data,
then, dereplicate the combined sequence data

Maybe there was a single command that takes care of all this in RESCRIPt (or elsewhere in the QIIMEverse)?

Nicholas_Bokulich · September 2, 2020, 5:03am

No, just three separate actions