Rescript to Train Silva Naive Bayes classifier for 16S data

sfreeman · July 9, 2024, 8:47pm

Hello! I am relatively new to microbiome analysis and I am working on setting up and training a naive bayes feature classifier using the rescript plugin following the Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt and the Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs tutorials (I am planning to use it for analysis of 16S data). I have gotten to the "Expanding Reference Segment Sequence Pool" portion of the second tutorial and I am a little confused about what to make of the results (e.g. whether I did this correctly, and when to stop going through the "extract seq segments" function.

After extracting my primer, the tabulate sequence .qzv file shows the sequence length statistics at sequence count = 279796 with a min length of 54 and mean length of 274.7. After my first iteration of the extract seq segment function (with culling and dereplicating), sequence length: 312378 min length: 98 mean length: 438.6, after my second iteration of the extract seq segment function (with culling and dereplicating), sequence length: 214529 min length:121 mean length: 809.2

The code I have used for the "extract seq segments" function is below. I have been dereplicating and using cull-seqs in between iterations.
qiime rescript extract-seq-segments
--i-input-sequences /taxonomy_nb/silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--i-reference-segment-sequences /taxonomy_nb/silva-138.1-ssu-nr99-seqs-515f-806r-uniq-cull.qza
--p-perc-identity 0.9
--p-min-seq-len 10
--o-extracted-sequence-segments /taxonomy_nb/silva-138.1-ssu-nr99-extracted-seq-segments-01.qza
--o-unmatched-sequences /taxonomy_nb/silva-138.1-ssu-nr99-unmatched-sequences-01.qza

How will I know that my data is ready for the classifier? I am also confused why sequence length decreased after the second iteration of extract seq segment function.

Thank you!

Nicholas_Bokulich · July 10, 2024, 5:08am

Hi @sfreeman ,
As you have 16S V4 region data, I would suggest using only the first tutorial (i.e., to prepare the database) and not the second (using extract-seq-segments). You could also just use one of the pre-trained 16S V4 classifiers from the QIIME 2 website.

extract-seq-segments was designed for cases when you need to assemble a reference database from somewhat patchy starting sequences, e.g., for marker genes that do not have curated databases yet, or which might have incomplete coverage of the target amplicon. However, the SILVA database consists of curated full-length 16S sequences so should not require this step. It might still make sense for primer targets that are near the ends of the reference sequences, but the V4 domain is right in the middle of the 16S so should be covered by all or most SILVA sequences.

I hope that helps!

sfreeman · July 10, 2024, 12:29pm

Hi @Nicholas_Bokulich ,

Thank you for the clarification! As a follow up question, do you recommend that I follow the steps from the first tutorial to make the amplicon-region specific classifier
(with qiime feature-classifier extract-reads
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer GGACTACNVGGGTWTCTAAT
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza)
Or is it best to perform the fit-classifier function after initially filtering the sequences by length and taxonomy?

Thank you again!

Nicholas_Bokulich · July 10, 2024, 12:56pm

Hi @sfreeman ,
As you are using V4 you can follow that tutorial exactly (provided that you used the same primer sequences). But this is also how the pre-trained classifiers on the QIIME 2 website were built, so you could download these if you want to save some time.

Well that is also a step in that tutorial correct? There are some abnormally short outliers that should be filtered out before dereplication and training.

sfreeman · July 10, 2024, 1:29pm

Hi @Nicholas_Bokulich ,

I apologize for the confusion. My question was just to confirm that I should be running the feature-classifier extract-reads function and then running the feature-classifier fit-classifier-naive-bayes function, as opposed to running the feature-classifier fit-classifier-naive-bayes function without the extract-reads function, as the tutorial includes both options.

Nicholas_Bokulich · July 10, 2024, 2:24pm

Thanks for clarifying! Understood now.

Yes, do this if you have V4 data. It will slightly improve accuracy

this is also fine. Extracting sequences that match your amplicon region will improve accuracy a bit, but not by a lot, so some people prefer training their classifier on the full 16S (because then they can classify multiple different regions).

But you have V4 data so I suggest using extract-reads prior to training the classifier.

Good luck!

system · August 11, 2024, 12:51am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.