curation recommendations for UNITE database based on sequence length

Joeee · November 22, 2024, 5:23pm

Not sure if it's best to post here, but your tutorial on using Qiime2 with the SILVA database @SoilRotifer you use this script to filter the sequence length. Would these values also be appropriate for use with the UNITE database? Thanks!

qiime rescript filter-seqs-length-by-taxon
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
--i-taxonomy silva-138.1-ssu-nr99-tax.qza
--p-labels Archaea Bacteria Eukaryota
--p-min-lens 900 1200 1400
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza

SoilRotifer · November 22, 2024, 8:29pm

Hi @Joeee,

That is a great question! Also, just to clarify, as has been mentioned in that tutorial, feel free to change up the workflow outlined there. Also keep in mind this tip within that tutorial:

Feel free to modify any of the steps in this tutorial, or their order, to best suit your needs!

For example:

Note / Tip : Depending on your goals, it may also be reasonable to use the raw imported sequences, or output from either cull-seqs or filter-seqs-length[-by-taxon] as input into the above extract-reads command.

Remember that tutorial is just showing what you can do. I rarely run filter-seqs-length-by-taxon when I make my own SILVA classifier, unless I really only want near full-length sequences. That is, if you are making an amplicon-region specific classifier, you can skip that length filtering step as you may drop shorter reads that actually contain quality data across your amplicon region of interest.

Unless you are aware of why you need to trim based on taxonomy, I'd likely skip it... or if you simply want to trim everything it will be faster to use rescript filter-seqs-length. Also, as you are using UNITE, and if you still want to trim by taxonomy, you'll want to change the taxonomy labels to what you need.

Hopefully this helps.

Joeee · November 27, 2024, 4:07pm

Thanks! I ran runs for both amplicons (which is what I'm really interested in) and for all the data, but might have figure if some of my script settings make sense I did at least manage to get some visualisations from them which indicates that my scripts are functional.

My fungal data looks awfully suspicious however, so I'll have to re-do some stuff following your advice

Nicholas_Bokulich · November 27, 2024, 7:28pm

Hi @Joeee ,

How so? Because you have many samples with very high amplification of plant DNA? Are you looking at plant-associated microbiomes? Soil? Or diets of herbivores? This is a common issue with ITS — most ITS primer sets also amplify plant ITS (and animal ITS etc... all euks!). The ITS primer sets that do not amplify plants tend to have lower coverage...

And actually this is also an issue with most 16S primers — chloroplast and mitochondria have 16S genes! Notice in your bacterial plot that you have lots of reads annotated at phylum level as cyanobacteria. Given the many plant reads in your ITS data, I bet these are Chloroplast.

Joeee · November 28, 2024, 3:47pm

I think on a second lookyou're almost surely correct about this and I just realised I'm also able to drag the graphs left to right to see a similar Viridiplantae dominance in my 16S data. This is my second ever microbiome analysis but I'm now remembering I had similar results a few years ago

I'm slightly ahead of schedule as I just wanted to get things working, so I may choose to re-adust my filters based on @SoilRotifer's comment though. I was trying to make the graph on R but Qiime2's built in functions has saved me a lot of time.

P.s. My study system is investigating the plant-soil microbiome. The treatments are combinations of fresh/autoclaved soil (high or low nutrient) and fresh/autoclaved/no treatments of frass. For example, we may have a treatment of autoclaved high nutrient soil with a fresh frass additive.

Joeee · November 28, 2024, 6:26pm

Another oddity I noticed is that I seem to have much higher species richness values (ASV) than some OTU figures I was previously given, would different filtering be likely to lead almost doubly high results or could it be down to using a different reference database? I notice this pattern only shows in the soil sample and not at all in the root or leaf samples.

Nicholas_Bokulich · November 29, 2024, 9:21am

Hi @Joeee ,
It is tough to say. You are comparing ASVs vs. OTUs, different filtering, presumably different sampling depths for rarefying, different databases... there are so many variables that differ here so it is not surprising that you see such a wide variation.