Ready to wear weights with non v4 sequences

jwdebelius · May 3, 2021, 5:15pm

Hi all,

I'm trying to build a bespoke classsifier using V34 primers. Due to a short timeline and a philosophy that it's usually better to get some one else to do complex computation for you, I was hoping to be able to use some of the weights from the Ready to Wear Repository.

Will the full length weights work if I trim to another over lapping region or do I need to train my own weights?

Thanks,
Justine

Mehrbod_Estaki · May 3, 2021, 11:40pm

Hey @jwdebelius ,

I think you'll want weights specific to your region.

And on that note, (though not directly answering your question) a while back I had a chat with @BenKaehler about putting together some additional weights to readytowear, including, V3-V4, here is a summary of that conversation in case it helps or you find yourself in a contributing mood (but also as a public shaming of myself for not getting my portion done):

Adding V3-V4 weights: We discussed that since we can't grab denoised/processed V3-V4 data from Qiita we would have to manually find public V3-V4 samples, denoise them with DADA2, and then they can be used in clawback. Ben mentioned a minimum of at least ~120 samples per environment would be needed to see positive gains. They shouldn't be biased samples from a specific disease either. A large heterogenous mix probably would be ideal.
- There was a mention of automating clawback to work with SRA queries as well, though I'm guessing that needs some external momentum to get going.
The existing V4 primers used for readytowear are based on the old EMP primers, might be worth adding additional ones with the updated EMP primers. (doubt this will have a big impact overall though).
- Those primers were used because the pre-cooked classifiers on the QIIME 2 resource page also uses the old EMP primers. @Nicholas_Bokulich and @SoilRotifer any thoughts on updating or adding the new EMP primers into the resource page (using rescript)?
Update to include newest GTDB (2 new releases exist now since the 89 relase. latest: Release 06-RS202 as of April 27, 2021)

Since the EMP website seems to be down at the moment, these are old & and new V4 primers I mentioned above (copied from EMP website):

Updated sequences: 515F (Parada)–806R (Apprill), forward-barcoded:
FWD: GTGYCAGCMGCCGCGGTAA;
REV: GGACTACNVGGGTWTCTAAT

Original sequences: 515F (Caporaso)–806R (Caporaso), reverse-barcoded:
FWD: GTGCCAGCMGCCGCGGTAA;
REV: GGACTACHVGGGTWTCTAAT

jwdebelius · May 4, 2021, 12:28am

Thanks @Mehrbod_Estaki!

I think then the specific answer to my question is to use the full length classifier becasue training specific weighting seems like more than I want to do for a one-off project. (See aforementioned laziness ). I wondered if one solution for other region that might at least represent some kind of average or midpoint might be to at a minimum filter the weights so that you could have a different set or subset of weights based on what was amplified

Best,
Justine

BenKaehler · May 4, 2021, 5:47pm

Hi @jwdebelius,

Thanks very much @Mehrbod_Estaki for remembering that conversation. I note that the Caporaso primers are still used for the pretrained classifiers. Should we recommend that they update those?

@jwdebelius, there are two options I can think of that won’t take months of development. In my experience they will probably give fairly similar results.

Just use full length sequences and full length readytowear weights. That is, don’t trim anything.
Trim your reference db using your V3V4 primers then use it to build new weights for your habitat of choice.

I know you said you were too lazy for the second option, but it is probably just as easy as downloading weights from readytowear. You can probably do it with a single call to clawback assemble-weights-from-Qiita. You would have already done most of the steps in the tutorial.

It’s up to you, though, and it probably won’t make much difference.

I hope that helps.

Ben