Assistance with Genus-Level Classification and Filtering ASV/OTUs

fatihfarabigazali · December 16, 2024, 10:01pm

Hello,

I was wondering how to obtain a complete genus-level classification. In my case, the green-colored Bacilli have a significant presence in the bar plots, but I have very limited information because the classification is stuck at level 3.

Additionally, I have too many ASVs/OTUs in my bar plots, and I’d like to filter out those with lower abundances. Could you please advise on how to accomplish this?

Here’s what I’ve already done to filter my data:

qiime feature-table filter-samples \
  --i-table table.qza \
  --p-min-frequency 11000 \
  --o-filtered-table filtered-table.qza

qiime feature-table filter-features \
  --i-table filtered-table.qza \
  --p-min-frequency 10 \
  --o-filtered-table feature-frequency-filtered-table.qza

qiime feature-table filter-features \
  --i-table feature-frequency-filtered-table.qza \
  --p-min-samples 2 \
  --o-filtered-table sample-contingency-filtered-table.qza

qiime feature-table filter-samples \
  --i-table sample-contingency-filtered-table.qza \
  --p-min-features 10 \
  --o-filtered-table feature-contingency-filtered-table.qza

I also removed mitochondria, chloroplasts, archaea, and eukaryota from the table and sequences. However, is there a way to filter out ASVs/OTUs representing less than 0.1% abundance so I can focus on the main contributors?

The same colors cause confusion as well. I assume they are the same type of bacteria; however, due to differences in levels, they are given different names despite having the same color.

Thank you for your help!

Best regards,

jwdebelius · December 17, 2024, 4:13pm

Hi @fatihfarabigazali,

It sounds like you have 3 issues that need to be addressed, and I'm going to work through them kind of piecemeal, if that's okay.

This likely has to do with your classifier and enviroment. What type of samples are you studying? Which database did you use? Did you get a pretrained classifier, and if so which one? Which primer pair did you use? Can/should you weight your classification by the enviroment?

So, I the level of effort for these solutions varies and not all are tested because I tend toward solution #3.

But, here are some options.

Hide any taxonomy shown after the first 12. That's going to get you to the most abundant taxa (they're ordered by abundance) and you'll have white space representing an "other" category. It's not fool proof, it doesn't specifically meet your needs, but it's a good way to solve the problem quickly.
You could potentially export your data into excel, determine what you want to collapse, and then use the "group" function to cast everything else into a different taxonomic category. This would likely have some challenges. I dont know if you could use RESCRIPt to modify the taxonomy so you only retain the most abundant genera, but it might be an option too.
I often go outside of QIIME 2 to do my stacked barplots, mostly because I like a lot of control and I find it easier to have that control using other plotting approaches. So, you could export your relative abundance data to a csv, open that in Excel (or R or python), and then manipulate there. I tend to filter to the top N features with a relative abundance fo at least Y%. I usually set my Y% at 1.00% because that's about all my eyes can differentiate in a stacked barplot. There are other visualizations I can use if I want to see the distribution of a lower abundance clade. The number of features is typically set by my colormap where I can keep at most the number of avalaibel colors minus 1. (So for your current colormap, you could have up to 11 groups plus the 12th color for your "other" category). IIRC, the human brain can percieve about 8-12 distinct colors in a plot, assuming semi normal color vision. (Obviously, this changes for humans with colorblindness and other visual challenges, but on average, assume about 8). Most modern data management programs will have a plotting function to make you a nice stacked barplot that you can then manipulate to your heart's content.
I've not used it, but I have heard amazing things about the microshades package in R. Their colormaps are georgous and their appraoch to color grouping is quite igenious. It's on my list of QIIME 2 plugins to construct in my copious negative free time.

Best,
Justine

fatihfarabigazali · December 19, 2024, 4:06am

Hello Justine,

Thank you so much for your response.

I have used the 2024.09.backbone.v4.nb.qza (Naive Bayes classifier trained on the V4 region) and am running my code on the command line (Linux server). I am working with plant-based samples. Regarding your question, "Can/should you weight your classification by the environment?"—does this refer to using weighted UniFrac files? Could you please elaborate on this a little more?

Regarding your comment, "the level of effort for these solutions varies and not all are tested because I tend toward solution #3"—thank you for the recommendations. I am planning to try both solutions #2 and #3.

I appreciate your time and help!

Best regards,

jwdebelius · December 19, 2024, 4:33pm

Hi @fatihfarabigazali,

If you're using the greengenes V4 classifier, your samples should be amplified with 515F-806R primers. Im not actually sure if 150nt is advised, you may want to double check the provenance. If you didn’t use 515-806R primers, I’d recommend using a different classifier. (You could either use a full length classifier or train your own.)

I would also be aware of chloroplast in your sample, and how many you expect.

One way to improve environment classification can be to use a bespoke or environmentally weighted classifier. The classifier takes into account expected taxonomic distributions by environment to help improve the classification. For example, I might expect a different Lactobacillus species in my yougurt sample than a vaginal sample, and so the classifier could help with that identification. I’m not sure if there are bespoke environmental classifiers for GG2, so it may not be the right option.

Best,
Justine

fatihfarabigazali · December 23, 2024, 8:24pm

Hello Justine,

Thank you so much for your response.

I have been trying different databases and alternative options. I ended up using Rescript with Silva 138.2. The weights (I used the average one) did not work for me, as my environment (processed and packed food samples) is completely different from what is described on the website.

Currently, I am trying to perform further analysis in Qiime2R and have encountered an issue with the following error message:

> tse <- qza_to_tse(features="table.qza", taxonomy="taxonomy.qza", tree="rooted-tree.qza", metadata="sample-metadata.tsv")
Error in `rownames<-`(`*tmp*`, value = new_rownames) : 
  invalid rownames length

I believe the issue occurs because I am using the filtered-table.qza file, which does not match the length of the taxonomy.qza file. Therefore, I need to filter the taxonomy.qza file to match the features in table.qza. Is there any code to accomplish this, or am I thinking about this incorrectly?

(p.s.: when I used unfiltered table.qza file, it worked. However, I assume I should use the filtered-table.qza for further downstream analysis)

Thank you for your help!

Best regards,

colinbrislawn · December 25, 2024, 6:02am

I think you are correct, the phyloseq error is due to the filter, especially because it works with no filter.

An alternative method is to import the full data, then do the filtering in R on the Phyloseq object. This way Phyloseq will make sure everything matches. (You can also do this in Qiime2, but I don't have an example on hand!)

fatihfarabigazali · December 25, 2024, 7:53pm

Hello Colin,

Thank you for your response. I was able to filter the taxonomy.qza file and obtain the matching results. However, I am still getting the same error.

It might be related to qza_to_tse itself. Do you think it could be something else?

colinbrislawn · December 26, 2024, 4:44pm

Thanks for posting that full screenshot! I've noticed that you are using different tree files as inputs to those functions.

rooted-tree.qza
rooted_tree.qza

What happens with matching inputs?

fatihfarabigazali · December 26, 2024, 6:58pm

Collin,

Thank you so much for your feedback. I have fixed the typo issue, but I am still getting the same error message.

jbisanz · December 26, 2024, 11:59pm

If you would be comfortable sending me your artifacts I could take a look.

fatihfarabigazali · December 28, 2024, 6:29pm

Hello Jordan,

Thank you so much for offering your help! The only issue I am encountering is with the tse <- qza_to_tse function—it does not work in my case. Aside from this, I am able to generate plots and graphs without any problems. However, I am unsure where exactly the tse <- qza_to_tse function is required.

To address the issue with the tse <- qza_to_tse function, I have tried several approaches, including:

Filtering only eukaryotes, archaea, etc. or
Using the following QIIME 2 command:

qiime feature-table filter-features \
  --i-table table.qza \
  --p-min-frequency 10 \
  --o-filtered-table filtered-sequences/feature-frequency-filtered-table.qza

However, none of these attempts have resolved the issue, and I continue to encounter the following error:

Error in `rownames<-`(`*tmp*`, value = new_rownames) : 
  invalid rownames length

Collin mentioned above that the phyloseq error is likely due to the filtering process, especially since the function works when no filter is applied. He suggested importing the full data and performing the filtering in R on the Phyloseq object, as this ensures that everything matches correctly. He also noted that filtering can be done in Qiime2, but he didn’t have an example on hand.

This indicates that the tse <- qza_to_tse function will accept unfiltered data, so I need to filter it within R instead. Could you guide me on how to filter specific groups, such as eukaryotes, mitochondria, etc, and/or remove all features with a total abundance (summed across all samples) of less than 10 in R using your codes?

When I generate taxa-bar plots, I notice that there are "Remainder" parts. Is there a way to rename "Remainder" as "Others"? Additionally, is the threshold for "Remainder" set to 1% in this case? I would also like to show only genus names in the plots. Currently, when I select the genus level, the entire taxonomic pathway for ASVs/OTUs is displayed. Could you advise me on how to modify the code to display only genus names?

Thank you so much for your time and assistance!

Best regards,

colinbrislawn · December 28, 2024, 8:49pm

If you don't feel comfortable posting your qza file publicly, you can click on someone's name and send them the files in a Message

fatihfarabigazali · December 28, 2024, 9:34pm

I have shared the filtered versions with you and Jordan since I was having issues with them. Thank you very much!

fatihfarabigazali · December 31, 2024, 3:44am

Hello Colin and Jordan,

I wanted to check if you’ve had a chance to review the files.

Thank you!

lizgehret · January 2, 2025, 5:39pm

Hi @fatihfarabigazali,

Just a friendly reminder that Colin and Jordan may be slower to respond due to the winter holidays/new year. Please be patient; they will get back to you as soon as they are able

colinbrislawn · January 3, 2025, 2:08am

(Because I've never used TreeSummarizedExperiment and Jordan is the developer of qiime2R, I'm waiting patiently for him to take a look.)

I'm sorry I'm not more help here!

fatihfarabigazali · January 5, 2025, 4:17am

Thank you Lizgehret and Colin,

Sounds good. I will wait for him to respond

system · February 13, 2025, 11:57pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.