Removal of Overrepresented sequences from 16S metagenomic sequences

Hi,

I am working with a 16S metagenomic sample targeting the V3-V4 region, with paired-end reads that have varying lengths. The forward read length is 260 bp and the reverse read length is 134 bp. Due to the shorter read length of the reverse reads, I am encountering difficulties while trying to perform DADA2 and merge the R1 and R2 reads. I received the following error message:
"No reads passed the filter. trunc_len_f (260) or trunc_len_r (134) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter."
I have tried various combinations of truncation lengths, but the same error persists.

Then, decided to analyze both reads independently as single-end data.
While checking the quality of both the reads using FastQC, the basic Statistics for both the reads are reasonably good with higher per base sequence quality.
However, other parameters such as Per tile sequence quality , Per base sequence and GC contents are very bad.

*The biggest issue I'm facing is with the Sequence duplication levels and Overrepresented sequences, where ~64% of the sequences are overrepresented.

So, my questions are:

  • Does DADA2 also takecare of Overrepresented sequence removal while denoising????

How can we address these "overrepresented sequences" within QIIME2?.

  • And, is there another way to work with this kind of data as paired-end,
    or do you have any specific suggestions????

Thank you in advance for any help or suggestions!

Best,
Rashmi Ira
AS_S120_V3V4_L001_R1_R2_001_fastqc (2).zip (927.7 KB)

1 Like

There are several issues with the16S V3V4 data that I am analysing:

  1. The reverse (r) read (around 124 nts) is shorter than forward (f) read (260nts) leading to failure in pair joining in the DADA2 step. Will it be advisable to analyse such data as single-end instead of paired-end?
  2. The FASTQC reports suggest red flags in certain parameters including, overrepresented seq, pertile quality, and per base sequence content. In such a case, should one be performing any pre-processing steps before proceeding with QIIME2 analysis?

Please advise

1 Like

Hi All,
I merged these two questions because they are asking very similar things. Let me do alittle bit more reseacrch on fastqc and I'll be back with more answers!

2 Likes

Hi all,

I think this is a reasonable approach if you absolutely can not merged.

What kind of sequencing machine did this come off of? I am not familar with reads being different length straight off the machine. Is this the truncation parameter? if so I might mess around with it to see if I could merge the reads!

Could y'all post your dada2 command so that I could get a better idea of whats happening?


Alright lets take a DEEP dive into fastqc with amplicon data!

When I ran fastqc on the moving pictures demux data this is what I saw. So kinda the same things that y'all are reporting. I think we are seeing this because fastqc is a metagenomic focused tool not an amplicon tool.

Screenshot 2024-09-27 at 10.16.29 AM

Overrepresented Sequences
According to fastqc, they expect for metgenomic sequencing that there is basically 0 chance that the same exact sequence would show up twice because they are sequencing all the DNA in the whole genome(sometimes in a whole microbiome). But in amplicon we would expect duplicated sequences because if we sequence the same genus twice we are going to end up with the same amplicons sequence duplicated. Thats how we count abundance with amplicon data! And if there is any genus that shows up more the 0.1% of the time you will get this failure. So in summary, I would say that this failure from fastqc is probably expected in this type of data.

To answer your specific question about this:

No it doesn't, because we expect duplicated sequences! If we didn't we wouldn't be able to use amplicon sequence variant counts as estimated abundance.

cite: Overrepresented Sequences

GC content
Looking at my GC content for the moving pictures data:


My "peek" is more conserved then there normal distrubution is expecting thats why I get a warning here. This isn't surprising to me given the fact that we are targeting a more conserved region (16s amplicon as compared to the whole genome). I think amplicon data is intrinsically biased to be more conservative and fastqc is indentifying this bias but its not a sequening bias. Its again a data type bias.
cite: Per Sequence GC Content

Per Base Sequence content
Okay so same story here. 16s is a relatively conserved region and fastqc is expecting a random distribution of bases (becasue that would be whats expected in whole genome fragments). In their docs, they talk about 3 things that could cause Per Base Sequence content bias. Our data type has 2 out of the 3.

The 2 culprits:

  1. Overrepresentative sequences: which we know we have becuase of our previous error that we have overrepresented sequences and as discussed we should probably expect this.
  2. Biased composition libraries: which we also have because we are targeting a specific region of the genome.
    So again, fastqc is assuming randomness and we are using a fragment of the genome that is not really random.

Interestingly, we can see in my data that its actually extremely biased toward the begining of the sequence probably because thats near the really conserved part that we target!

cite: Per Base Sequence Content

Final thoughts on sequence content, GC content and duplicated sequences
I would probably not take these 3 quality tests too serious when working with amplicon data. These tests are checking for a bias that amplicon data almost certainly has.

To directly answer your question:

I don't think there is preprocessing steps that should be done(besides running dada2 which I will discuss later). Like I said above, these are expected biases, fastqc just doesnt know that.

Per Base Sequence Quality

This seems reasonable enough for amplicon data but it is almost exactly the output of qiime demux summarize.

My data failed on this test:

This indicates that I should trim my data durning dada2! This is the first fastqc step that I think can be applied to our data type and qiime2 does have the tools to correct this!

cite: Per Base Sequence Quality

Per Tile Sequence Quality
Tells a similar story but would be more helpful if you were debugging issues with a sequencing machine. In the case that you are not doing this, I would just get this quality info from Per Base Sequence Quality(I think its easier to interpret.)

cite: Per Tile Sequence Quality

Lastly Adapter Content

I don't think y'all mentioned failing this step but I think this could be applied to amplicon data. It could help you identify if adapters are in your sequences. Of course ,for 16s you would have to identify your primers at that point and remove them.

cite: Adapter Content

Conclusion
You defintely can use fastqc to investigate sequence quality of amplicon datqa but alot of their tests are checking for biases that amplicon data has and so they will fail and there isnt anything to do to fix it. fastqc is probably best used for metagenomic sequencing because these assumptions of randomness is more accurate to whole genome sequencing.

Sorry for the info dump!
I hope this helps!
:turtle:

7 Likes

Dear @cherman2,

Thanks for your prompt response.

We have used MiSeq (Illumina), where the library was prepared by using QIASeq kit (for 16S/ITS region panel). Earlier, the run chemistry was 301*2, but because of some network issues, there was an interruption in the run cycles, and we could not retrieve the complete data. Then, we created a new run for the same interrupted dada for their demultiplexing in the system by changing the read lengths according to the completed run cycles, i.e.,

yes, here is the command:
qiime dada2 denoise-paired
--i-demultiplexed-seqs Aged-sludge-16S/paired-end-demux-trimmed-exp-jyoyika.qza
--p-trim-left-f 2
--p-trim-left-r 0
--p-trunc-len-f *258 *
--p-trunc-len-r *132 *
--p-max-ee-f 2
--p-max-ee-r 4
--p-n-reads-learn 1000000
--o-table table-dada2.qza
--o-representative-sequences rep-seqs-dada2.qza
--o-denoising-stats stats-dada2.qza
(likewise for other values of truncation length)

Oh, okay. But is there any range for considering duplicated sequences while performing denoising? In our case, we are getting lots of duplicate sequences.

Thanks for your time and great explanation.

Best,
Rashmi

1 Like

Thank you @cherman2 for the detailed explaination and prompt response on the queries.
In reference to the same, I request you to please look into the following:

  1. The % of overrepresented sequences is very high (in the range 50-70%) as per the FASTQC files. Is it normal to have such high no. of overrepresented sequences?
  2. Further, these overrepresented sequences are repetitive sequences. Could these be sequencing artifacts or a part of some microbial taxa?

The library preparation kits used were QIAGEN QIAseq 16S/ITS 348 Index (Sets A,B were used of the 2 A,B,C,D) and QIAseq 16S/ITS Region Panel (96).

I would really appreciate if you could look into these queries and advise on the same.

Hello Pratyusha,

I'm not Chole, but hopefully, I can help answer some questions.

Fastqc is looking for problems for untargeted (shotgun) data. :warning:
But these are not problems at all for targeted (amplicon) data! :white_check_mark:

  1. Yes, high duplication rates are normal for amplicons.
  2. Yes, they could be sequencing artifacts, but high duplication rates are normal for amplicons.

I like Chloe's explanation:

Finally, I want to remind you that only you can attest to the quality of your data.

4 Likes

Hello Rashmi,

I'm sorry to hear about the interrupted run. I think your solution makes a lot of sense and should work okay to salvage this run. Good thinking!

That's expected for targeted amplicon sequencing.

In fact, in amplicon sequencing we expect all real reads to appear more than once!
Duplication is GOOD when working with amplicons!

Qiime2 provides a run quality report that makes more sense for amplicon analysis.
The Atacama Soils tutorial provides an example of this report for paired-end data.

2 Likes

Hi @colinbrislawn

Thanks for your response and clarification. It’s very helpful in highlighting the importance of duplicate or overrepresented sequences in amplicon data.

I’ll definitely review this tutorial again to better understand the read-quality assessment.

Thanks again to both @colinbrislawn and @cherman2.

Best Regards,
Rashmi

1 Like