After extracting my primer, the tabulate sequence .qzv file shows the sequence length statistics at sequence count = 279796 with a min length of 54 and mean length of 274.7. After my first iteration of the extract seq segment function (with culling and dereplicating), sequence length: 312378 min length: 98 mean length: 438.6, after my second iteration of the extract seq segment function (with culling and dereplicating), sequence length: 214529 min length:121 mean length: 809.2
The code I have used for the "extract seq segments" function is below. I have been dereplicating and using cull-seqs in between iterations.
qiime rescript extract-seq-segments
--i-input-sequences /taxonomy_nb/silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--i-reference-segment-sequences /taxonomy_nb/silva-138.1-ssu-nr99-seqs-515f-806r-uniq-cull.qza
--p-perc-identity 0.9
--p-min-seq-len 10
--o-extracted-sequence-segments /taxonomy_nb/silva-138.1-ssu-nr99-extracted-seq-segments-01.qza
--o-unmatched-sequences /taxonomy_nb/silva-138.1-ssu-nr99-unmatched-sequences-01.qza
How will I know that my data is ready for the classifier? I am also confused why sequence length decreased after the second iteration of extract seq segment function.
Hi @sfreeman ,
As you have 16S V4 region data, I would suggest using only the first tutorial (i.e., to prepare the database) and not the second (using extract-seq-segments). You could also just use one of the pre-trained 16S V4 classifiers from the QIIME 2 website.
extract-seq-segments was designed for cases when you need to assemble a reference database from somewhat patchy starting sequences, e.g., for marker genes that do not have curated databases yet, or which might have incomplete coverage of the target amplicon. However, the SILVA database consists of curated full-length 16S sequences so should not require this step. It might still make sense for primer targets that are near the ends of the reference sequences, but the V4 domain is right in the middle of the 16S so should be covered by all or most SILVA sequences.
Thank you for the clarification! As a follow up question, do you recommend that I follow the steps from the first tutorial to make the amplicon-region specific classifier
(with qiime feature-classifier extract-reads
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer GGACTACNVGGGTWTCTAAT
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza)
Or is it best to perform the fit-classifier function after initially filtering the sequences by length and taxonomy?
Hi @sfreeman ,
As you are using V4 you can follow that tutorial exactly (provided that you used the same primer sequences). But this is also how the pre-trained classifiers on the QIIME 2 website were built, so you could download these if you want to save some time.
Well that is also a step in that tutorial correct? There are some abnormally short outliers that should be filtered out before dereplication and training.
I apologize for the confusion. My question was just to confirm that I should be running the feature-classifier extract-reads function and then running the feature-classifier fit-classifier-naive-bayes function, as opposed to running the feature-classifier fit-classifier-naive-bayes function without the extract-reads function, as the tutorial includes both options.
Yes, do this if you have V4 data. It will slightly improve accuracy
this is also fine. Extracting sequences that match your amplicon region will improve accuracy a bit, but not by a lot, so some people prefer training their classifier on the full 16S (because then they can classify multiple different regions).
But you have V4 data so I suggest using extract-reads prior to training the classifier.