I have trimmed a dataset with these primers "The V3–V4 region of 16S rRNA was amplified by PCR with primers 341F 5′‐barcode‐CCTACGGGNGGCWGCAG‐3′ #and 785R 5′‐GACTACHVGGGTATCTAATCC‐3′"
Although, the output resulted super weird... Am I doing something wrong? I wonder if the barcode before the forward primer is the problem (5′‐barcode‐CCTACGGGNGGCWGCAG‐3′) but since it is already demultiplexed, I think it shouldn't even be there anymore.
Can you re-run this comand using the --verbose option? This will print out the cutadapt output to screen for each sample it is processing. It will show the amount of sequences in which the primers were found and removed.
Generally, if the primers are in the sequence, you should see above 90% of the reads being trimmed. If not, then :
double check the primer sequences used (there are several V3V4 primer pairs).
it could be, that the sequencing protocol used, does not result in reads with the primer sequence contained within them, meaning you would not need to run cutadapt.
Thanks for sharing the verbose output @Liviacmg! As I suspected, quite often nearly 0% of the reads are being trimmed with those primer sequences. For example:
Sample ERR4569889:
=== Summary ===
Total read pairs processed: 65,178
Read 1 with adapter: 14 (0.0%)
Read 2 with adapter: 6 (0.0%)
== Read fate breakdown ==
Pairs that were too short: 78 (0.1%)
Pairs discarded as untrimmed: 65,100 (99.9%)
Pairs written (passing filters): 0 (0.0%)
But some samples, like ERR4569897 have 99% of their reads trimmed:
=== Summary ===
Total read pairs processed: 62,470
Read 1 with adapter: 62,355 (99.8%)
Read 2 with adapter: 61,947 (99.2%)
== Read fate breakdown ==
Pairs that were too short: 3 (0.0%)
Pairs discarded as untrimmed: 636 (1.0%)
Pairs written (passing filters): 61,831 (99.0%)
I am wondering if there is an inconsistency in how the sequences where generated. Perhaps using different primer pairs for different variable regions (i.e. V4 & V3V4), or some samples where sequenced with a methods that do not include the sequencing primer as part of the output.
This actually explains why there are two quality dips near the end of the reads prior to cutadapt. I honestly think there are two different amplicons in the data! So after you trim, using discard untrimmed, you only keep the data for the V3V4 sequences sequences with primer. Not sure if the other is V3V4 without primers, or a different amplicon, perhaps V4?
I am looking at the first sample I mentioned in the GenBank SRA and it looks like 99% of the reads are bacterial reads.
Yeah, this is the study! Jeez... this situation from the cutadapt reflects the grep command, because some didn't have the primer pair (0), while others had ~60k. Hmm.. so you think I should try not using --p-discard-untrimmed?
Or maybe just use the raw sequence without trimming on subsequent analysis? Or try to trim just the ones which appeared with high numbers on grep?
On their paper they only describe briefly in this section " PCR and Sequencing : The V3–V4 region of 16S rRNA was amplified by PCR with primers 341F 5′‐barcode‐CCTACGGGNGGCWGCAG‐3′ and 785R 5′‐GACTACHVGGGTATCTAATCC‐3′, as previously described.[43] High‐throughput sequencing was performed on an Illumina MiSeq instrument according to the manufacturer's instructions"
I saw that too, but the data does not seem to reflect that. Again, it could be that some of the samples simply do not contain the primer. In which case try not using the --p-discard-untrimmed. I'd still recommend trimming though, otherwise you'll obtain different sequences of different lengths. This would result in artificially inflating differences between samples with and without the primer sequences.
Also, you'd still need to confirm the sequences without the primers are in fact the same gene sequence as those with the primers (i.e. V3V4). Again, I am hoping they are all V3V4, but some with primers and some without. Then generate some beta diversity plots and check for anything out of the ordinary.
I just looked at the sequences manually. It does look like some of the samples were uploaded with the primers already trimmed (or perhaps they were sequenced in such a way that the primers are not sequenced). So, I think simply running without --p-discard-untrimmed would be the first thing to try.
Or... run --p-discard-untrimmed only on those samples which contain the primers. Then merge them with the other samples that do not contain them? Then you can denoise. Either way, you have a few approaches to try.
So, if I would run --p-discard-untrimmed only on the samples with the primers, I would have to run cutadapt individually for each sample? And then how would I merge then later? With feature-table merge?
Hmm.. that reverse read is bothersome. I think when not using --p-discard-untrimmed it is possible that the forward read might be trimmed but not the reverse, and/or vice versa.
The new figure looks much better. More sequences, and no minimum / median 0s.
Just for the sake of curiosity, I'd try setting your tuncation values and denoise. Also, it might be worth considering to ignore the reverse read and simply work with the forward read dada2 denoise-sinlge.
I'd simply make two different manifest files. One for the sample with the primer (as suggested with your grep output), and another manifest for the rest. Then run cutadapt with --p-discard-untrimmed on the reads with the primer. Then, just for the sake of sanity, run cutadapt on the sequences w/o the primers but not using --p-discard-untrimmed, just in case there are spurious primers in there somewhere. Or you can skip running cutadapt on the reads w/o primers.
I do not think there is a way to simply merge sample sequence data after importing. The only point in which the data can be merged is post denoising. I think it'd be reasonable to denoise these two sets separately and then merge the table and seqs. As far as I know, as long as you have a sufficient number of samples for dada2 to train on... you should be fine. But, again... we really know nothing about how or why these samples where processed differently. They may not cluster appropriately on PCoA plots, etc.. Have you contacted the authors for clarification on this?
I tried doing this (separating reads with primer, without primer, import manifest file, run cutadapt like you said) and the reverse reads with the primer looked weird. Is it maybe because I used p--adapter when I should have used --p-front? I did a test and I think this was the problem.
(Also, I did it to the reads without primers as well and it looked ok; and I tried to email the authors and didn't have any answer)
After running DADA2 on the forward reads with primer, 49 samples didn't pass the filter and had non-chimeric inputs 0 (I think they were excluded), but the others resulted in an output like this:
I wonder if I should exclude the ones with percentage of input non-chimeric (with 10 to 30% aproximately) and I'm thinking if this is the best strategy... or if by excluding these samples when I try to merge then with the others DADA2 output there would result in some type of error because there will be a lot of missing samples.
Thanks for sharing these @Liviacmg. I think for those samples, in which you're losing more sequences, appears to be due to losing proportionally more due quality filtering rather than chimera removal. The drop from denoised to non-chimeric isn't as much of a drop as the quality filtering. This could probably be helped by truncating a little more.
Another option is to adjust the --p-min-fold-parent-over-abundance to 8 or 16. I wouldn't suggest any higher than that. This is based on the following:
Wow, it improved so much more with --p-min-fold-parent-over-abundance set to 8 :0 i'm impressed!! I didn't know about this parameter!
But still, 49 samples didn't pass the filter and had input/filtered/percentage of input passed filter/denoise/non-chimeric inputs 0 or a very low number, which I think is compatible with the trimmed_sequences.qza because a lot of samples seemed to be removed. Do you think that would result in an error while trying to merge the tables, after running DADA2 on the other reads? I'm thinking this can be a problem.