Hello community,
I apologize in advance this will be a lengthy post, but i will be as concise as possible. Thank you for reading!
I am running Qiime 2023.9 via conda. My reads come from soil samples, bacterial dna, and the 341f-805f primer pair was used. When looking at the demux quality plots, the quality of the reverse reads was too poor, or after cleaning there was not enough overlap, so I decided to discard them and continue only with the forward reads.
I want to perform taxonomic classification, so for that purpose I was following this Rescript tutorial using Silva138.1 database , but then I came across this post that explains how when making an amplicon region specific classifier, sequences that do not contain both primers get omitted.
Based on this picture I think my sequences are similar to line 1 or 3 yet those would not be taken into consideration as they don't contain both primers. So I followed the above tutorial to enlarge the sequence pool of the database and the results changed from the first picture to the second:
Now my issue is, the sequence count did grow larger but so did the sequence length from 50% being at 422 to the 50% mark being at 1029. In one of the discussions in the comments, a moderator explains that lengths in the 1000 range is normal for that gene, but as far as I know for V3-V4 region the length should be around ~460 bp. Does that mean that I should use 'extract-reads' to trim closer to the 460 mark? But if so, I will be losing so many reads as even the 25% mark (=569) is higher than that. Additionally, the training classifiers tutorial over on qiime docs says that in the case of single-end reads trimming is not recommended: "For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed".
I also found this pre-trained classifier, that uses the same primer pair and the Greengenes 99 data base thanks to one of the moderators, but from the same concern that my sequences are only the forward reads and since not both primers are contained, the taxonomic assignment might not work properly I don't know if it's appropriate to use for my data.
So to sum everything up:
- I have bacterial dna single end reads using the 341f-805f primers
-I used rescript on silva138.1 database and made a specific classifier for my primer pair
-Because both primers have to be present for the sequence to be included, I decided too enlarge the database using 'extract-seq-segments'
-Sequence count got bigger, but length reaches 1000 (was expecting values around 460 range, trimming is not recommended for single reads)
I would really appreciate it if you could give me some advice as to how to continue from this point as I am quite stuck.
Thank you very much for reading and for the so very helpful forum <3
I look forward to your reply. If anything is unclear, please feel free to let me know. Thank you!
Best regards,
Riva