Training classifier without primers information

Sky23 · September 18, 2020, 2:02pm

Hi qiime2 users,
I have a 16S rRNA V3V4 raw sequences from a company with primers in them.
Company did not provided me the primer sequences since it is proprietary.
They told me to cut 16nt for forward and 24 nt for reverse from 5' end to remove primers.
I would like to use silva 132 database for classification.
How can I train a classifier without knowing primer sequences? Is there any way?
Or should I use whole silva sequences for classification?
Any suggestion will be appreciated.
Thank you
Regards
Sudhir

SoilRotifer · September 18, 2020, 5:36pm

Hi @Sky23,

You are fine using the length silva references for classification. New SILVA 138 classifiers, and the files used to make them, are available on the Data Resources page. But if you'd like to make an amplicon specific classifier, you can do a few things:

Extract positions from curated silva alignment.
- You can try aligning the first 16 bp of the forward read and the 24 bp of the reverse read (reverse compliment the reverse read) to a subset of the actual curated silva alignment and approximate and extract based on the alignment positions as I outline here here. See step 7. Note: many of these steps, were ported over to RESCRIPt, except for the positional alignment extraction step I'm referring you to. Hopefully, we'll have a native QIIME 2 way of doing this soon.
- This is probably simpler approach: Align a subset of your ESVs (i.e. output from DAD2 / deblur) to a subset of the silva curated aligned silva reference database, and note the positions and extract those from the alignment, as I outline in the link above. Obviously, using the trimming options for DADA2 / deblur to trim the primers off.
Probably the easiest: Use a popular set of v3v4 pimers (which, in all likelihood, are quite similar to the proprietary primers that were used) to extract
the amplicon region:
- 341F: 5'-CCTACGGGNGGCWGCAG-3'
- 805R: 5'-GACTACHVGGGTATCTAATCC-3'
- If you want to see how "close" these extracted reference sequences match-up to your ESVs, you can then align a subset of both the reference sequences and the ESVs so see how much longer or shorter one set is to the other, then modify accordingly.
- Caveat: there may be differences in primer bias when using the above primers to extract the sequences from the reference data set as compared to the primers actually used to sequence your data.

I hope this helps!
Mike

Sky23 · September 19, 2020, 11:17am

Hi Mike,
Thank you for your a detailed explanation.
I will probably go with New SILVA 138 classifiers from Data Resource since It might be easier for me to do.
You were right that proprietary primers are very close to 341f/805r, except for few bases at the end.
1st 16 bp Forward reads from my fastq: 5’-GCCTACGGGAGGCTGC-3’
1st 24 bp Reverse reads from my fastq: 5’-CGACTACTAGGGTATCTAATCCTG-3

As per my understanding qiime2 does not have function to cut certain number of base pairs from 5' end of fastq files. Cutadapt function need to be supplied with primer sequences. Am I correct?
Therefore, I am using dada2 step to remove primers using following commands

qiime dada2 denoise-paired
--i-demultiplexed-seqs trimmed-fastq.qza
--p-trim-left-f 16
--p-trim-left-r 24
--p-trunc-len-f 321
--p-trunc-len-r 255
--p-n-threads 1
--p-min-fold-parent-over-abundance 8
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza

Is it ok to remove primers like this during dada2?
or
Should i do the trimming outside of qiime2 first and then dada2 in qiime2? I will appreciate if you can suggest any program to trim fastq outside qiime2.
or
Use generic 341F: 5’-CCTACGGGNGGCWGCAG-3’, 805R: 5’-GACTACHVGGGTATCTAATCC-3’ during cutadapt to remove primers sequences and then perform dada2 in qiime2? I guess! proprietary primers are slightly different so using 341F/805R might not clean all the primer bases from my fastq. Will it be an option?

Thank you for your time.
Regards

SoilRotifer · September 19, 2020, 1:37pm

Hi @Sky23,

You're welcome.

Correct.

Yep, that is perfectly fine.

This approach will work too. You can just use the --p-trim-* options of DADA2 to remove any excess bases that were not covered by the 3'-end of the above primer sequences. Anything before the primer on the 5' end will be removed by cutadapt. But the DADA2 trim option is the simplest.

Good luck!

Sky23 · September 20, 2020, 10:55am

Thank you very much your time.
Your suggestions were very helpful for my analysis.
Regards.