Different taxonomic classification: with or without primers trimming

Hey, I am seeking a little guidance in understanding, if I am doing anything wrong while trimming primers using q2 cutadapt trim-paired command.

A brief overview, I am using qiime2-2018.11. I sequenced 16S rRNA (V3-V4 region using Pro341F/Pro805R primer, degenerate primers). Sequences were already demultiplexed when provided by the sequencing facility. So, I imported them using Casava 1.8 paired-end demultiplexed fastqformat. Then,

I followed 3 different routes just to get an idea how accurate I am proceeding with my data analysis

  1. In First route, I ran DADA2 without trimming primers, followed by qiime feature-classifier extract-reads command to remove primer sequences after DADA2 from my repseq prior to taxonomic analysis

qiime dada2 denoise-paired
–i-demultiplexed-seqs demux-paired-end_805.qza
–p-trim-left-f 0
–p-trim-left-r 0
–p-trunc-len-f 284
–p-trunc-len-r 235
–o-table featuretable_primer.qza
–o-representative-sequences repseqs_primer.qza
–o-denoising-stats denoisingstats_primer.qza

qiime feature-classifier extract-reads
–i-sequences repseqs_primer.qza
–p-f-primer CCTACGGGNBGCASCAG
–p-r-primer GACTACNVGGGTATCTAATCC
–o-reads rep-seqs_extracted.qza

qiime feature-table filter-features
–i-table featuretable_primer.qza
–m-metadata-file rep-seqs_extracted.qza
–o-filtered-table table_extracted_Filetered.qza

  1. In second route, I ran q2 cutadapt trim-paired command, to remove primer, followed by DADA2 step (using same parameters as mentioned above to have fair comparison between both routes)

qiime cutadapt trim-paired
–i-demultiplexed-sequences demux-paired-end_805.qza
–p-front-f CCTACGGGNBGCASCAG
–p-front-r GACTACNVGGGTATCTAATCC
–o-trimmed-sequences Cutadapt_regular.qza
–verbose

  1. In 3rd route, I ran DADA2 by specifying, --p-trim-left-f and --p-trim-left-r parameters to trim/remove primers

qiime dada2 denoise-paired
–i-demultiplexed-seqs demux-paired-end_805.qza
–p-trim-left-f 17
–p-trim-left-r 21
–p-trunc-len-f 284
–p-trunc-len-r 235
–o-table featuretable_primercut_DADA.qza
–o-representative-sequences repseqs_primer_cut_DADA.qza
–o-denoising-stats denoisingstats_primer_cut_DADA.qza

For taxanomic analysis, I used

qiime feature-classifier classify-sklearn
–i-classifier gg-13-8-99-nb-classifier.qza
–i-reads repseqs_primer.qza
–o-classification taxonomy_primer.qza

So, after comparing DADA2 output and taxa-bar-plots of these 3 routes, I got very strange results:

  1. Route 1, resulted in loss of almost 80-85% reads as Chimeras during DADA2 step (which is probably due to the presence of primers), However, filtration, denoising and merging looks fine.
Route 1 Roue 2 Route 3
With primers Primer cut

(after DADA2 with extract-reads)|Primer cut (Cutadapt)|Primer cut (DADA2)|
|input|55444|55444|55444|55444|
|filtered|43924|Cannot have this information as extract-reads step was done after DADA2|35430|44046|
|denoised|43924|35430|44046|
|merged|29642|26575|32310|
|non-chimeric|7113|20001|22963|

  1. Route 2 and 3 resulted in similar taxonomic classification, but, completely opposite to route 1

Relative Abundance
Route 1 Roue 2 Route 3
Phylum With primers Primer cut

(after DADA2 with extract-reads) Primer cut (Cutadapt) Primer cut (DADA2)
Proteobacteria 23 % 23 % 54 % 51 %
Bacteroidetes 31 % 31 % 18 % 20 %
Firmicutes 31 % 31 % 19 % 18 %

So, by looking at this comparison, I think, it is useless to remove primers after DADA2!! (It’s better if we do it before), but on the other hand I am not sure, if removing primers with either Cutadapt or directly through DADA2 results in best results as I am not highly convinced with such huge differences in taxonomy with/without primers. Can you help me in understanding, why there are such huge differences!

I will be happy to share more results, if needed.

Hi @NidaAmin,
Welcome to the forum!
You should certainly remove primers from your reads BEFORE dada2. This is in fact a requirement of this pipeline. The choice of doing it either directly with --p-trim within dada2 itself or using cutadapt is up to you and dependent on your set up. For example if the primers are all the same length then you can simply use trim in dada2, but if they are of variable length then cutadapt is your better option. You should also double check to make sure that these have not already been removed by your facility, a lot of times core facilities remove the primers during the demultiplexing step as well.

The feature-classifier extract-reads plugin you used is not meant to remove primers at all. This tool is used to extract a specific region you are using (i.e v3-v4) from a reference database (ex. SILVA or greengenes) of full 16S reads. This is then used to train a classifier that is specific to your reads and finally you use that to assign taxonomy to your reads. Have a closer read through the tutorial here on how to train a classifier and this diagram that summarize it nicely as well.

Good luck!

1 Like

Thanks @Mehrbod_Estaki for your quick response.

I did understand the things you have already mentioned. About the feature-classifier extract-reads plugin, I used this to remove primers by following this discussion on qiime2 forum Separating two different amplicons from demultiplexed data

But, thanks for correcting me, (This plugin is meant to be used for another purpose (extracting specific region and training classifier etc)

Now my major concern is such high variability in the relative abundance of phylum (With or without primers trimming). As far as, i have understood Cutadapt worked perfectly fine as it detected 98-99% of primer sequences in my reads and removed them with (93.7% efficiency, which is great). But, when i try to compare final taxa-bar-plots of with primers and without primers (Cutadapt method), As you can see here there are huge difference in percentage of phylum median relative abundance

With Primers: p__Proteobacteria (23%), p__Bacteroidetes (31%), p__Firmicutes (31%)
Without Primers (Cutadapt): p__Proteobacteria (54%), p__Bacteroidetes (18%), p__Firmicutes (19%)

So, i am kind of scared now, as normally people say that there are not so many differences with/without primers and removing primers can only helps you to have better low taxonomic classifications (you can classify until the level of genus or even species sometimes), Correct me please if i am wrong!!.

Moreover, if i try to compare my results with other studies, its very hard to do comparison as Cutadapt method is quite new and normally 70-90% of already published data didn’t remove primers. So, if i compare my with primers results with reference studies its quite similar to them. But, if i try to be very accurate and use Cutadapt to remove primers, I know i am following right path, but on the other hands my results look strange. What do you think; Am i doing any mistake in my analysis or its normal to have at least some differences in final taxonomy with/without primers.

Any help from your side will be highly appreciated.

Hi and welcome in the forum!

If I can add my cent, I think that your choice of keeping the same trimming settings for dada2 to compare with and without trimming is a bit unfair toward the with-trimming route. After the trimming the sequences are shorter, so you may end up with less sequences passing the following dada2 filtering step. I would see the quality profile for the reads after the primer trimming to chose the specific length. Otherwise, you may disable the length trimming and use the quality score filter only in both cases.

Luca

1 Like

Hi @llenzi, Thanks for the input on my query.

I did lot of comparisons already using same as well as different trimming settings for DADA2 (before and after trimming primers, depending on the reads quality profile). (sorry, here, i just showed the results with the same trimming lengths to make things easier and clear to understand).

But, the fact is that, trimming length has not much effect on the final output in my case. The major effect is with/without primers as both cases are resulting in different taxonomic classification (irrespective of the trimming lengths; as i have already tried best trimming parameters separately for with/without primer cases).

Hi,
ok, it is kind of reassuring for me to know that the trimming length has not much effect on the final results (still in my mind should affect the filtering and the merging steps so I am a bit buffed by that result …).
Do you have any positive control/mock community in your dataset? Did you try to use a more recent version of qiime2 (although I don’t expect much on this … )

Luca

Hi @llenzi,

I didn’t have positive control in my dataset (i just have negative controls and they are getting effected in the same way as other samples with/without primer trimming).

I didn’t try to use the recent version of qiime2 (as i am using 2018.11 which is not too old i guess). But, if you think its worth trying the latest version. I can give it a try as-well. Waiting for more suggestion on this issue…

Nida

Hi @NidaAmin,
Those findings are unsettling for sure if true. Let's do some more digging.
Do your reads contain non-biological sequences such as overhang adapters, spacer, etc? The methods you currently have set up will search for the locus-specific portion of your primers in your reads and remove them and all preceding sequences. This is what we want, because those non-biological sequences will just lead to spurious ASVs and will have poor taxonomic assignment because well they are not real. How do your unassigned taxa look across your 3 scenarios?
If you do have those non-biological sequences then when you are using dada2 to trim, you're not really removing primers and in fact probably will have left some non-biological sequences in. This can certainly be the reason behind the differences you see. If this isn't the case with your data, we would need to look at the actual data and the exact commands you've performed to find the source of this variation. You can post your final artifacts and we could follow their provenance to look for potential issues.

The original cutadapt paper is from 2011 so it's not all that new! And I'm not sure about the stat you are reporting...where is that from exactly?
Eitherway, removing non-biological sequences is a must but the debate as to whether keep the loci-specific portion of the primers in your reads has some merit. I can't remember the source from the top of my head but I do believe removing those redundant primer sequences has been shown to slightly improve classification.

As for some of the other points:

  1. Your trimming of the 5' sequences in dada2 shouldn't really affect the filtering and merging steps unless the quality score in those initial sequences are particularly bad which would cause DADA2 to discard them. Keeping the same truncating parameters however is key in these comparisons, so the way you've done is good.
  2. Using the newest version of Qiime2 is always recommended. Dada2 has had performance upgrades in the most recent version decreasing processing time by multiple folds! And the classifier has also been trained using a newer version of scikit bio. That being said, I don't think the differences would be quite as large as you are seeing. Something else must be driving your results.

Hi @Mehrbod_Estaki

That makes me wonder on my datasets, I usually get a worst looking quality profile after cutadapt PCR primer trimming because, I think, the resulting sequences are out-of-sync.
As if the primer would not be always at the same position.
But that is on a different thread so leave it here!

Thanks
Luca

Hi @Mehrbod_Estaki, Sorry for the delayed response i was collecting all the necessary information to make things more clearer.

I talked to the sequencing facility, and they told me that my sequences were dual indexed for sequencing, demultiplexed by the sequencing facility and they have already removed the Adapters and Indexes during demultiplexing step and the only thing that I have in my raw reads is 16S primers at the beginning of the reads (It is to be noted that the primers I used were degenerate ones).

I am attaching the final taxa-bar-plots, I didn’t have any unassigned taxa (with all 3 routes).

Since, as I have mentioned, I didn’t have overhang adapters, indexes etc, and just have primer sequences in the start of RAW reads. So, it makes sense that if I don’t remove primers from my raw reads prior to DADA2 step, I am introducing biasness in my analysis by following (route 1), where I did directly DADA2 using just –p-trunc-len parameters and It will not remove primers, and will results in poor filtrations, denoising, and merging and may remove many sequences as chimeras, and may also result in repseq with primers sequences inside as I observed (may be that is why I am having low non-chimeric sequences with route 1 in comparison to route 2 and 3)!. What do you think if I am interpreting it in a right way or as per your experience (presence of primers should not need to have such high effects on the denoising stats !!!)

Since, i am quite amazed to see such huge differences between (with/without primer routes), I would like to get an idea what is triggering such huge differences....... So, let dig deeper into the issue, I am pasting all the commands that I used along with their artifacts

  1. Sequences were imported in Casava 1.8 paired-end demultiplexed fastq format. Then, I followed 3 routes of analysis:

With primers (no primer removal step was involved in this route)

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end_805.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 284
--p-trunc-len-r 235
--o-table featuretable_primer.qza
--o-representative-sequences repseqs_primer.qza
--o-denoising-stats denoisingstats_primer.qza

Primers cut in DADA2 step

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end_805.qza
--p-trim-left-f 17
--p-trim-left-r 21
--p-trunc-len-f 284
--p-trunc-len-r 235
--o-table featuretable_primercut_DADA.qza
--o-representative-sequences repseqs_primer_cut_DADA.qza
--o-denoising-stats denoisingstats_primer_cut_DADA.qza

Primers cut with cutadapt followed by DADA2

qiime cutadapt trim-paired
--i-demultiplexed-sequences demux-paired-end_805.qza
--p-front-f CCTACGGGNBGCASCAG
--p-front-r GACTACNVGGGTATCTAATCC
--o-trimmed-sequences Cutadapt_regular.qza
--verbose
Cutadapt_output.txt (2.6 KB) Cutadapt.qzv (278.6 KB) demux-paired-end_805.qzv (273.3 KB)

Followed by DADA2

qiime dada2 denoise-paired
--i-demultiplexed-seqs Cutadapt_regular.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 284
--p-trunc-len-r 235
--o-table featuretable_primercut_cutadapt_R.qza
--o-representative-sequences repseqs_primercut_cutadapt_R.qza
--o-denoising-stats denoisingstats_primercut_cutadapt_R.qza

Then, files from all three routes were processed for taxonomic analysis by following command

qiime feature-classifier classify-sklearn
--i-classifier gg-13-8-99-nb-classifier.qza
--i-reads repseqs_primer.qza
--o-classification taxonomy_primer.qza

qiime taxa barplot
--i-table featuretable_primer.qza
--i-taxonomy taxonomy_primer.qza
--m-metadata-file metadata.txt
--o-visualization taxa-bar-plots_primer.qzv

Here, I showed results of analysis done by using only 1 sample and I used same trimming length in DADA2 and used greengenes as reference database for taxonomic classification. But, side by side, I have also done comparisons using all my samples (including negative control), used DADA2 parameters best fit for each route depending on quality plots and I also tried to use SILVA database as reference taxonomy as well. But, the differences that you can see in the attached files, they stayed same in each way of analysis. If you want, I can send you the raw sequence file (that I have used in this analysis)!.

Sorry for the long post, but i hope that i can understand if it is really normal to have such big differences in taxonomy with/without trimming primer... Thanking in advance for the patience.

NIDA

1 Like

Thanks for sharing extensive data, @NidaAmin!

To be frank, it looks like you have solved this in a sense, and I don't think you need to dig too deep!

You are seeing such massive differences when primers are not trimmed because:

  1. you are losing a very large proportion of reads at joining (merging) — this actually does not look great with any of them, but appears particularly bad when primers are not trimmed (no clue why that should have an impact on read overlap at the 3' ends). Honestly, I would worry a bit about the read joining with all of them, and you should perhaps be paying closer attention to your truncation parameters. Losing this many reads at joining indicates that you are selecting for shorter amplicons, systematically excluding specific taxa that have longer amplicons.
  2. You are getting phenomenal read loss at chimera checking. I am also not sure why having primers in the reads would cause this, but it clearly does.

So route 1 looks very different from routes 2 + 3 (which are similar enough), quite simply because having the primers in the reads (for whatever reason) is causing chaos and the outputs are severely skewed. There is no need to ponder any longer which is the correct route: route 1 is clearly bad and flawed, routes 2 + 3 give similar results and both look decent (route 2 looks better based on your denoising stats results).

Why having the primers in there is causing chaos for you (this is abnormal; we do not usually see impacts like this) is an interesting question, and something perhaps the dada2 developers should tinker with; but for your purposes you have clearly shown why route 1 is a bad approach and that the results are skewed.

I think the real issue you should be looking at is optimizing your truncation parameters so that you do not lose so many reads at merging. It would be better to lose reads at the filtering step — and the taxonomic compositions from those optimizations would be interesting to compare. Please open a new topic if you run into problems with that optimization!

Otherwise I'd say pick route 2 and don't look back...

Thanks!

1 Like

Hi @NidaAmin,

Thanks for sharing and as @Nicholas_Bokulich said it is a huge works!
There is only one thing on I am suspicious but it may be only my mind,
I am looking at the ‘Cutadap.qzv’, there are still few sequences of 300bp, I am not sure if this is right. I would think that the sequences should contain the PCR primer am I right? If so, they should be trimmed. So I am not sure what are the 300bp sequences. I would test cutadapt using the ‘–p-discard-untrimmed’ to see if there are weird sequences in the set.
Just to double check, can you confirm that the sequences orientation is not mixed?

For the rest, I agree with Nicholas, the primer removing route is the way to go, I usually used your number 3 but I think it does not really matter between 2 or 3 as long as you treat all the samples at the same way!

Luca

2 Likes

Hi @NidaAmin,
Just to echo the others, thanks for the extensive data and investigation before posting! This is a cool mystery indeed.
I am going to agree with what @Nicholas_Bokulich said but I just wanted to offer a potential explanation as why Route 1 is acting so iffy.

Route 2 and 3 are very similar, which is great. The differences between those two probably lies in the fact that route 2 (cutadapt) had many more reads filtered initially. It's hard to pinpoint why but probably has something to do with reads remaining after cutadapt removes primers. Length summary from those reads might shed some light on the matter. As @llenzi hinted and your cutadapt summary shows there wasn't 100% efficiency so some small percentage of your reads either had no expected primer sequences in them, they may be too short, or too low quality tails to have their primers called properly. When you look at the tail end of your cutadapt (route2) quality plots you can see those really poor quality tail ends. So overall you're left with some reads that DADA2 is filtering whereas in the other method they are retained and get assigned some taxonomy. But as Nick mentioned thees two routes are so similar that it is not worth losing any sleep over.

That is good! Though reads being assigned as Bacteria only may be indicative of the same issue, but again these are so small that it is not worth exploring.

As for why including primers has such as huge effect on merging and chimera removal I have some thoughts. Take it with a grain of salt cause I can't prove these but just my feeling...

Merging: It probably doesn't have much to do with the merging itself per se but rather with the fact DADA2 is tossing a lot more singletons. Since you have degenerate primers which weren't trimmed, the likelihood of singletons appearing are much much higher and by default DADA2 discards those (and their respected pair). So now you just have less reads in general to merge. I'm guessing the merging rate itself isn't really changing. So while fine-tuning your truncating parameters may help a little, I'm not optimistic that you would actually see any major differences. Hopefully I'm wrong.

Non-chimeric: The same problem is probably leading to many more chimeras being detected here. Since chimeras are more likely to be called with singletons (which can form again after merging) and very low abundant features, a lot more of your reads are being detected as chimeras, probably incorrectly. This is again because we are including degenerate primers. I would think non-degenerate primers would have less of an issue here or increasing --p-min-fold-parent-over-abundance value might help but that's just academic, don't actually use this method.

Overall this is a good reminder about the benefits of removing primers before further processing. They are redundant, provide no real new information, and just get in the way of other scripts.

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.