Trimming 341F reads to match 515F reads

Sam_Degregori · June 21, 2023, 11:46pm

Hi QIime team,

I am running a large meta-analysis of microbiome data with datasets that include the 341F primers and then the EMP 515F primers and I would like to cut at 515F to "standardize" them. I have seen a couple posts (Q2-cutadapt of primer sequences flag verification? & Reads processing with different primers) discussing the pros and cons of either fragment insertion or trimming prior to denoising.

Personally, I would love to trim my dataset in one go on one large file. There are many many studies in this dataset and we are not 100% sure which are V3-V4 and V4 and so it would be great just to pull out the same 515-805 region to combat that uncertainty.

So my question is: A) is that possible to cut in one go? And if so, what DNA sequence would you use as the marker? You can't use the 515F primer because that would find all the 515F reads but not any of the 341F reads correct? So would you have to use the actual 16S DNA sequence at the 515F region to tell qiime2 where to cut?

I know there are two other posts very similar to this but I did not see there actual cut adapt code and what DNA sequence they used to find and cut their reads.
Any info on this matter would be greatly appreciated!
Cheers,
Sam

gregcaporaso · June 22, 2023, 6:18pm

Hi @Sam_Degregori,
Can you link to the other posts you're referring to? I'd like to see those discussions.

My initial thought is that the approach you're suggesting probably isn't the right way to go about combining these data sets. If you trim where the 341F reads to where the 515F primer ends (to match the 515F sequencing data), I suspect you'll have very short forward reads (too short to join paired ends), in the lowest quality part of those reads (the ends), so this won't do a lot to help you merge the data sets.

Two other ways that you could approach this:

If all of the data sets used the 806R primer, you could work with the reverse reads only, in which case you could directly compare these across studies. I don't think your 341F/806R reads will merge anyway, so single-end is probably the only way to approach this analysis, and that's not too terrible of a thing (merged paired end reads are generally better, but the overall patterns should be consist with single-end reads from the same samples).
After generating a feature table for each data set independently, you could assign taxonomy and collapse at the genus level, and then combine the genus tables.

And finally, just a reminder that however you approach the analysis, you should expect to see some differences across the two data sets due to primer bias.

Sam_Degregori · June 22, 2023, 9:15pm

Hi @gregcaporaso,

Thanks for the insight. I updated my question so you can see which posts I am referring to and so others can as well.

So one issue is that due to data storage limits (over 200 16S studies) and standardization we only downloaded forwards reads since some studies only upload single end data. And we also use deblur which has worked way better for us.

If I understand you correctly... you are saying that if I cut the 341F's at 515F, then there is only a couple bp's left because there are likely only 250 -300bp long depending on the sequencing kit they went with? For some reason I though forward reads go all the way to the reverse read point as well. But now I see how starting at 806R would be much better.

When we plot the data we do get very large primer bias which makes sense, but my hope was that by cutting them at some overlapping part would remove this bias. Is that pretty much futile?

Based off the info you gave me and the fact that reverse reads might limit our number of studies to only paired end data, our best option might be to just live with the data we have and subset the two primers as separate datasets.

Thanks for the insight!

gregcaporaso · June 23, 2023, 6:03pm

Hi @Sam_Degregori,

Yes, exactly.

That won't deal with the primer bias issue - the problem is that different primer pairs amplify the 16S from different taxa with different efficiency (e.g., due to different numbers of primer mismatches to targets), so it can't be corrected at the bioinformatics stage.

Yes, that might be the case. An alternative though is to start that way, assign taxonomy to all data sets independently, and then merge tables at the genus level. In that case you're using the genus-level taxonomy assignments to map the different ASVs into a "common feature space". You'll still have the primer bias issue of course, but you may be able to still gain some insight from the merged studies.

Sam_Degregori · July 18, 2023, 3:01pm

@gregcaporaso we went ahead and subsetted everything out by primer and ran some standard analyses and the variation is much easier to understand now without the technical factors. Thanks for the insight on all this.

From a meta analysis standpoint its a bit unfortunate that 341 and 515 became equally popular in gut microbiome research but I guess we can't complain about having more data.

gregcaporaso · July 20, 2023, 10:40pm

@Sam_Degregori, glad to hear it's making more sense now! You're welcome, happy to help.

Best,
Greg

wasade · October 2, 2023, 10:50pm

Hi @Sam_Degregori,

Thank you for flagging this thread on Slack. I thought I'd respond here in case the response is of interest to others.

Similar to the suggestion @gregcaporaso made regarding collapsing taxonomy, it is also possible to collapse the phylogeny. I don't believe the two approaches have been systematically compared though. A benefit of the phylogenetic approach is that taxonomic labels are not consistent in meaning, and depending on the taxonomy, distinct clades may be represented by the same label.

I don't believe a phylogenetic collapse is currently implemented within the suite of QIIME 2 packages. A small program to do this, using a .biom table and a Newick tree is attached, where you can specify the width of the clades to collapse, or a percentage of the maximum tip-tip distance in the tree. It outputs a collapsed feature table, and phylogeny.

As for the different primer data, to use this method, the data would need to be mapped to a common reference (e.g., closed reference). With Greengenes2, this would be the non-v4-16s action and applied to both data sets.

Note: I would be eager to work with someone to formally represent the phylogenetic collapse method in a QIIME 2 plugin if someone is interested. Regrettably, it is not something I can do near term. The attached script has inline unit tests, will work with a standard QIIME 2 environment, and would be a great introductory item to QIIME 2 plugin development.

All the best,
Daniel

collapse_to_phylogeny.py (5.1 KB)