I have some conceptual doubts about what correct parameters I should choose to optimize a classifier. Unfortunately my paired-end analysis did not work. This is why I decided to try the R1 reads or the R2 reads. We amplified and sequenced the V3-V4 region, so I assumed that the R2 reads represent the V4 region, which is widely used in metagenomics.
This are my questions
*The V3 region is represented by the R1 reads (forward) in the amplification of the V3-V4 region?. And vice versa, the V4 region is represented by the reads R2?.
2) Am I cutting in the 5' or 3' direction of the R2 sequence?
Finally this leads me to want to create a custom classifier. But the question arises:
3) Where should I cut the reference sequences? (right or left of the sequence?) [Whereas query sequences are reads R2]. 4) Is it advisable to truncate the reference sequence to the size of the query sequences?
For example I should have a quality sequence of 300 bp, however the good quality only goes up to 200 bp (in R2 reads). Consequently, could I conclude that these 100 bp of poor quality would correspond to the first part of the V4 region (approximately 515-615F)? and therefore I should cut the reference sequences at the beginning, that is, cut to the left about 100 bp?
Your assumptions/first few questions seem reasonable. You will be cutting the 3' end of your R2 sequences. The R2 is still read 5' -> 3' by Illumina machines. Since it is read in as a single-ended sequences, you will still just use a trunc command which cuts the sequence at the 3' end. This brings us to your next few questions, where you want to make a classifier.
You will still use feature-classifier extract-reads to get the reads from your database sequences. I think since you are using your reverse reads, but as single-ended reads, you would provide your "reverse primer"(probably 806R?) to --p-f-primer and your "forward primer" used in your sequencing to --p-r-primer, and set --p-read-orientation to reverse, and then you can set --p-trunc-len 200 to cut your database reads to the same length as your sequencing reads. Then you will pass these sequences and your taxonomic annotation file to one of the fit-classifier methods.