Thank YOU for the help and for brainstorming with me!
Here is the artifact from the dataset we discussed previously: https://drive.google.com/drive/folders/1zywQjHyLtP7QxJ053ZGjoClDPJQp6t3t?usp=sharing
Additionally, I have run a couple of tests (as per your suggestions) just to isolate the problem. This time I used the entire dataset (16S and cDNA seqs) with the same primer set, 341F CCTACGGGNGGCWGCAG and 785R GACTACHVGGGTATCTAAKCC.
The different setups are as follows:
- default parameters
- overlap = 10, error rate = 0.2
- overlap = 10; error rate = 0.2; 5’ anchored
- overlap = 10; error rate = 0.2; 5’ anchored; discard untrimmed
- overlap = 10; error rate = 0.2; 5’ anchored; no indels
I have only attached the trimming logs (see: Google Drive link: primer-trimming-logs) due to the large file sizes of all the resulting artifacts but I can of course provide them if needed.
Based on preliminary observations, there are marginal differences between the output of both Setups #3 and #5. It would appear that the --p-no-indels condition definitely limits the trimming just to sequences at the target positions but the output nevertheless appears to be similar for both. The --p-discard-untrimmed command seems to be what is causing half of the reads to be lost but given my lack of experience in the area I cannot quite understand why, even after having read the cutadapt guide multiple times over.
I wonder what could be the reason behind not having detected the primers in 50% of total reads and what the effect would be of retaining those untrimmed sequences. Perhaps you could help me understand this a little better. I am also keen to learn what you think of the different outputs and what looks more ‘acceptable’ of an outcome to your more experienced eyes.
Thank you again for your guidance. As you can probably tell, I need plenty! Looking forward to hearing back from you!