According with the facility, the primers and barcodes have been removed already. Looking at the overrepressented sequences via fastqc, these does not seem to contain primers.
Does anyone have any idea what might be happening here?
Hello,
Could you try to remove biological primers with cutadapt and enabled option "discard untrimmed"?
If your output will contain only few reads/be emty, that means that primers indeed were removed already. If you will get the output that is comparable with input, but with better quality profile at the beginning of the reads, that means that primers were still in sequence.
I would expect higher number of reads discarded (closer to 90%) if indeed the primers had been previously removed. With ~80% in some samples, I think the primers were indeed already removed but a bit unsure if there is some problem here (maybe it's an expectation management issue).
Nonetheless, if indeed the primers have been previously removed, as it seems, what could justify these quality drops in the beginning?
If your reads are in mixed orientation then you'd run as outlined here with the primers and reverse complement for the --p-front-* flags. Most do not need to run cutadapt in this say, so the first example should work.
But, based on the obtained results, it seems to me that the primers were indeed removed. So, any idea why would I have these quality drops in the beginning? or maybe what caused them?
Can you run --verbose on that cutadapt command? If the majority of the reads are being trimmed then the primer is still there. If most reads are discarded, then you're correct, and the primers have been removed. I've seen the drop at the beginning of some reads, but not sure why they occur, I usually just trim those first few bases off.
Also, keep in mind, the sequencing technicians will often confuse "illumina sequencing primer" with "amplicon primer". So, they may have removed the illumina sequencing primer, but not your amplicon primer. Some methods will read through the amplicon primer, and you will need to run cutadapt, other methods do not read through the amplicon primer, and you do not need to use cutadapt.
Here is why:
You can see in the output for the first sample, that ATTAGAWACCCBNGTAGTCC and TTACCGCGGCKGCTGRCAC is only trimmed 6 and 22 times respectively. Whereas GTGYCAGCMGCCGCGGTAA and GGACTACNVGGGTWTCTAAT are trimmed 13,503 and 18,524 times respectively. Indicating the primers have not been removed.
Note, I think --p-discard-untrimmed will discard any sequences without all 4 adapters being found. There should be no reverse compliments in your data, again, unless they are mixed reads.
So, you should keep more data with only using the --p-front-* flags.
There were two additional lines missing from the code snippet I put earlier but, seem to be determinant for the trimming - --wild-cards for both reads and adapters.
So, I decided to remove the --adapters parameters, as suggested by @SoilRotifer and performed 2 different analysis:
One without the --wild-cards argument - meaning N's are not considered for the matching: Reads_full_cutadapt_log_norev.txt (550.9 KB)
With the --wild-cards, the first sample gets the 13,503 forward and 18,529 reverse events with adapters that were trimmed. However, there is still 77% of the reads discarded due to not be trimmed - is this still a valid indication that primers have yet to be removed?
It could also been that cutadapt was previously run but, without --wild-cards arguments, which would mean the reads we got from the service were lacking trimming of reads with N's in the primer areas. However, this would mean we would have variability in the sequence size (properly trimmed and untrimmed), which we don't have...
Maybe I am overthinking this but, the quality profiles seem a bit odd, specially the reverse - the drops in quality go up to 25 bp's!
Interesting. I wonder if your reads are in fact in mixed orientation? Perhaps try the approach I linked where you use the --p-front-* flags with the primer and its reverse compliment?
I wonder if the low quality scores at the beginning are the culprit? Perhaps alter your primer sequence for the forward read to, YCAGCMGCCGCGGTAA, and the reverse to TACNVGGGTWTCTAAT.
If neither of these work, then you can simply use the trim / truncation options from DADA2 / deblur, and forgo the cutadapt trimming?
The results are rather similar, mainly because the reads with the reverse complements are residual (~ tens of reads).
For me, that's the final alternative, no doubt. But I am a bit afraid of starting to cut in the beginning and that having a negative impact in the overlap of reads (Reverse would already be around 26 bp).
Finally, as an additional information, I have just gotten the information these samples were sequenced in a NextSeq 2000 machine.
It was worth a shot ... The other option would be to change the cutadapt --p-error-rate to a higher value. But that might lead to spurious trimming.
Trimming at the 5' (beginning) end will not have an effect on merging. It is the trimming at the 3' end that will have an effect, as that is the region that needs to overlap with the other read for merging.
If cutadapt is not working I'd suggest simply trimming 10 bases of the 5' end of the forward read, and ~40 bases of the 5' end of the reverse read, in addition to any 3' trimming you think is appropriate. Then denoise / merge. If that does not work, then you may simply need to use only the forward reads, which is also a common approach.
I was also wondering - with this quality profile, and considering your trimming suggestions, doesn't this also means that these results may hinder the use of this run with data from other runs? (this is an ongoing project, so we will receive samples continuously).
I am asking this since we will have shorter ASV's than for other runs with better quality, and that may impact diversity and abundance metrics no?
Sadly yes, as this will affect how the ASVs are generated. ideally you need to process each run similarly. Worst case scenario would be to re-prep / sequence these samples, or discard them all together.
Yes these will impact your comparisons, you'll not be able to combine this data with other data as they will be "seen" as very different ASVs. Essentially, behave as if they are different amplicons, and result in these communities looking very different from everything else. Do you have an example of other runs that worked? If so, what settings were used for those samples?