Training classifier using RESCRIPt and enlarging database with 'extract-seq-segments' for single-end sequences

Riva · January 31, 2024, 11:11am

Hello community,

I apologize in advance this will be a lengthy post, but i will be as concise as possible. Thank you for reading!

I am running Qiime 2023.9 via conda. My reads come from soil samples, bacterial dna, and the 341f-805f primer pair was used. When looking at the demux quality plots, the quality of the reverse reads was too poor, or after cleaning there was not enough overlap, so I decided to discard them and continue only with the forward reads.

I want to perform taxonomic classification, so for that purpose I was following this Rescript tutorial using Silva138.1 database , but then I came across this post that explains how when making an amplicon region specific classifier, sequences that do not contain both primers get omitted.

Based on this picture I think my sequences are similar to line 1 or 3 yet those would not be taken into consideration as they don't contain both primers. So I followed the above tutorial to enlarge the sequence pool of the database and the results changed from the first picture to the second:

Now my issue is, the sequence count did grow larger but so did the sequence length from 50% being at 422 to the 50% mark being at 1029. In one of the discussions in the comments, a moderator explains that lengths in the 1000 range is normal for that gene, but as far as I know for V3-V4 region the length should be around ~460 bp. Does that mean that I should use 'extract-reads' to trim closer to the 460 mark? But if so, I will be losing so many reads as even the 25% mark (=569) is higher than that. Additionally, the training classifiers tutorial over on qiime docs says that in the case of single-end reads trimming is not recommended: "For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed".

I also found this pre-trained classifier, that uses the same primer pair and the Greengenes 99 data base thanks to one of the moderators, but from the same concern that my sequences are only the forward reads and since not both primers are contained, the taxonomic assignment might not work properly I don't know if it's appropriate to use for my data.

So to sum everything up:

I have bacterial dna single end reads using the 341f-805f primers
-I used rescript on silva138.1 database and made a specific classifier for my primer pair
-Because both primers have to be present for the sequence to be included, I decided too enlarge the database using 'extract-seq-segments'
-Sequence count got bigger, but length reaches 1000 (was expecting values around 460 range, trimming is not recommended for single reads)

I would really appreciate it if you could give me some advice as to how to continue from this point as I am quite stuck.

Thank you very much for reading and for the so very helpful forum <3

I look forward to your reply. If anything is unclear, please feel free to let me know. Thank you!

Best regards,
Riva

SoilRotifer · January 31, 2024, 2:43pm

Hi @Riva,

Thank you for your detailed post! I think you are off to a great start!

As I mentioned in the post you linked:

Basically, you are getting leaky length extractions as you iterate.

Which I also reference here, that is:

... the initial PCR primer pair extracts only the amplicon region, the later iterations could expand your reference pool to sequences that contain a longer portion of your marker gene...

As I suggested, in the tutorial, and within these other threads, you can start with a different similarity threshold, and change the thresholds as you iterate. That is, you can:

... consider starting with a 90% cutoff, then increase by 2-5 % after each iteration, e.g. 90, 95, 97, ...

In my experience this has helped to cut down on the "length extension creep" that occurs with this approach. That is, you might extract a slightly longer sequence segment from your initial sequence pool. That sequence segment is retained and then used to extract other potentially longer sequence segments in the next iteration. Which means, you can then extract a yet an even longer segment, again, for the following iteration, and so on. So, it is probably better to perform more iterations at a higher similarity threshold. Then, at the end, you can remove any sequences that are longer than your expected target. I think if you are only adding ~10% more length, I think you should be fine. But this will vary from marker gene to marker gene.

I hope this helps! Please do keep us posted!

SoilRotifer · January 31, 2024, 4:28pm

@Riva I forgot to mention it is perfectly fine to use the entirety of the extracted V3V4 region from SILVA as a classifier for your forward reads. You do not need to make a 'forward read' only classifier. Though you could certainly try...

I would think that the following could work to make a forward-read only classifier: decide on on a truncation length when running DADA2 or deblur on your data. For example, you can set a truncation length of 220 base-pars on your forward read using DADA2 denoise-single.

Then, to make a matching forward read classifier you would run qiime feature-classifier extract-reads... being sure to add the --p-trunc-len 220 flag. Then you can use these as your input sequences for extracting your sequence segments from SILVA to make your reference database. I think this should work. Let us know if you give it a shot.

SoilRotifer · February 1, 2024, 3:46am

@Riva, I should have mentioned this... as we have the benefit of accessing a curated alignment from SILVA, another option would be to simply download aligned SILVA sequences, and import like so:

qiime tools import \
  --input-path SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta  \
  --output-path silva-138-1-nr99-aln-rna.qza \
  --type 'FeatureData[AlignedRNASequence]'

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138-1-nr99-aln-rna.qza \
    --o-dna-sequences silva-138-1-nr99-aln-dna.qza

If you know the alignment positions, you can simply use qiime rescript trim-alignment ... using the --p-position-start and --p-position-end to crop out the section of the SILVA alignment that contains the forward read.

Then you can run qiime rescript degap-seqs ... to remove all the alignment gaps (convert to unaligned sequences for purposes of training your classifier). I'd also use the --p-min-length option and set that to ~150 bases. This way any sequences with a length shorter than that, or might only consist of gaps in that region of the alignment, will be removed.

Basically, my old approach outlined here, should provide a better guide on how figure out the alignment positions etc. (skip to step #7 for the relevant content). Much of this old prototype pipeline can now be performed within QIIME 2 & RESCRIPt.

-Cheers!
-Mike

system · March 3, 2024, 9:47am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.