What is recommended is to try relaxing the truncating parameters, however, I have tried that and it is still not working. Here is some info about what I've been doing:
My samples are imported and the .qzv is here: samples.qzv (311.3 KB)
You will see that the first 12-13nts are of bad quality, so I was trimming them.
Then, I decided to maintain the whole length of reads and I was truncating on the last nts, 150-151.
Have you removed the primer sequences from the reads in samples.qza ? If you haven't this could explain poor quality in that region, and you will want to remove them anyway.
Secondly, what amplicon are you using? In particular its length.
If you trim your primers using the cutadapt plugin (I prefer to discard untrimmed sequences), and then remake the visualisation to select new trim parameters.
Selecting trimming length is a balance, too short and you will lose amplicon reads that are too long, too long and you make include poor quality bases that will cause the reads to be discarded anyway. Notice how you have a massive drop in quality at around 130bps? You will definitely want to trim before this drop as many of the reads will be discarded due to poor quality.
If you simply do not have long enough high quality sequences you could use just the reverse reads as the quality is better.
At first I was using cutadapt for trimming the primers. The thing is that they did not know how to tell me which primers were used exactly. The genomic platform told me that they used the Nextera Kit and sent me the documentation, but I don't understand exactly what I should be removing:
I'm not clear about which sequences I should be trimming... I tried with the first parts of index 1 (i7) adapters and Index 2 (i5) adapters, but it did not work and the samples that were imported to Qiime2 did not have any reads on them.
That is why I decided to import the sequences with no previous cutadapt step and just trim the reads based on the quality score graph.
The total length of the amplicon is 150.
I've perform other options and what I see is the following:
I need to trim the 13 first nts, that would be probably the primers. So I always trim.
If a trunc the reads in a length <150, no error is shown (in this case I should be truncating before 130, as you said)
If I decide to trunc in 150, so all the amplicon is mantain, I get the following error:
Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) :
Input must be a valid sequence table.
Calls: removeBimeraDenovo -> isBimeraDenovoTable
Why is this?
It seems that the best option would be to trim at 13 and trunc at 130.. However, the % passing the filter is really low..
The quality drop at the beginning of your sequences isn't significant enough to warrant trimming.
The thing is that they did not know how to tell me which primers were used exactly.
I'm guessing that "they" refers to the sequencing center that you used. If another party performed the 16S amplification and then sent the amplicons to the sequencing center then the latter wouldn't know which 16S primers you used. Which 16S primers were used for your amplicons is what needs to be figured out and then used to trim. Trimming using these primers will allow you to ignore the illumina adapters. (I'm assuming that these reads are indeed 16S amplicons).
Yes, exactly! I'm trying to contact the sequencing center to get the exact primers that were used... However, I have a doubt: What happens if I use cutadapt and insert wrong sequences as primers? The thing is that I have tried to use cutadapt with 2 "random" primers (which I thought could be the possible primers that were actually used) and what happens is that my samples run out of reads.Is that normal?
Yes yes! I understood. The thing is that our genomic platform is the same that performs 16S amplification and sample sequencing.
They finally gave me the correct primers sequencing.
However, I still have a huge loss of reads per sample (there is almost non Non-chimeric reads). I have also checked the samples with FASTQC and the % of duplicates reads is enormous...
From looking at your dada2 stats I can see that the problem isn't the chimera filter but the merging step. Almost none of your reads are merging. Do you know how long your amplicon is or which hyper variable region you're targeting? I also see that you chose truncation lengths of 112 forward and 128 reverse. According to the quality plot you posted earlier these are far too aggressive numbers and are most likely what's keeping your reads from merging.
The gene‐specific sequences used in this protocol target the 16S V3 and V4 region. They are selected from the Klindworth et al. publication (Klindworth A, Pruesse E, Schweer T, Peplles J, Quast C, et al. (2013) Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next‐generation sequencing‐based diversity studies. Nucleic Acids Res 41(1).) as the most promising bacterial primer pair. Illumina adapter overhang nucleotide sequences are added to the gene‐specific sequences. The full length primer sequences, using standard IUPAC nucleotide nomenclature, to follow the protocol targeting this region are: 16S Amplicon PCR Forward Primer = 5' TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG 16S Amplicon PCR Reverse Primer = 5' GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
This is actually afirst sight into this first sequencing that they have performed so we can decide if we should use a longer amplicon or if the followed protocol is correct.
Regarding the truncating lenghts, which ones would you recommend me to use?
I am not sure if this could affect, but I am comparing, in the same run, samples that have been amplified with different polymerases:
AmpliTaqGold was used for 2 samples
and
Kappa for the other 2
The analysis of this results and the comparison was aimed to decide which polymerase would be better.
If your amplicon is 300bp on average then two 150bp reads are not going to merge on average because at least 12bp (by default) of overlap are needed between the two reads.
To have any shot of merging then you'll have to basically do no truncation. I would try a run with this approach and see what happens.
And even no trimming primers, neither truncating reads, just to see what happens: stats-dada2-paired.qzv (1.2 MB)
Some merged values increase.. but I still lose most of my samples..
What could we do now? Is there any technological method to continue analyzing this sequencing files or should we move to another approach and sequence the samples again?
You can move forward with one or the other read directions only, using denoise-single. If you need to have full coverage of this amplicon unfortunately it looks like you would need to resequence with longer read lengths.