My sequences does not merge

VerheulJulia · May 16, 2023, 10:28am

Hello everyone,
One of the issues I'm experiencing in my analysis is due to the low sequencing quality. I believe that the sequencing company to which I sent my samples does not have an optimized protocol for the reverse, as I'm seeing a very poor quality graph with the same pattern as another analysis that was conducted with the same company on different samples a few years ago.

I performed an initial analysis with Qiime2, obtaining the following average read counts and the following quality graph:

|forward reads |reverse reads||---|---|---|
|Minimum |32764 |32764|
|Median |59342.0 |59342.0|
|Mean |66898.966667 |66898.966667|
|Maximum |165126 |165126|
|Total |4013938 |4013938|

Screenshot from 2023-05-16 10-42-351801×566 90 KB

After filtering with DADA2, the merge percentage merged is lower than 1%.

I decided to try using two filtering tools separately. First, I used Trimmomatic, and then I made another attempt with Cutadapt.

Trimmomatic

To achieve a quality score of 20, I had to trim the sequences with DADA2 at positions 230F and 152R.

After using the trimmomatic tool, I observed an improvement in the quality graph without significantly affecting the number of obtained reads. This allowed me to perform filtering with DADA2 using a length of 285F/232R. However, the issue is that the resulting table after filtering has very low merge percentages, with a maximum of 6.66%.

After processing my sequencing data with Trimmmomatic, I obtained a file containing sequences that have been successfully paired. However, when I proceeded with the analysis in QIIME 2, the percentage of sequence pairing was unexpectedly low. This is puzzling to me since Trimmmomatic specifically selects sequences that can be paired.

I have carefully reviewed the analysis steps and ensured that the parameters used in QIIME 2 are appropriately configured for my data, including the pairing settings for sequences processed by Trimmmomatic. I have also considered the possibility of biological variability in my samples.

Demultiplexed sequence counts summary

||forward reads|reverse reads|
|Minimum|29362|29362|
|Median|55000.0|55000.0|
|Mean|61351.066667|61351.066667|
|Maximum|154057|154057|
|Total|3681064|3681064|

I don't understand why all my sequences are considered as chimeric, and I dont know if i am doing anything wrong...

files with just dada2
demuxPrueba.qzv (322.8 KB)
denoising_stats_sintrim..qzv (1.2 MB)

files trimmered with trimmomatic
denoising_stats_trimmed.qzv (1.2 MB)
demuxtrimeado.qzv (324.4 KB)

Could you please help me understand why QIIME 2 is unable to successfully pair the sequences, even though Trimmmomatic has already selected them for pairing? I would greatly appreciate any insights or guidance you can provide to resolve this issue.

cherman2 · May 17, 2023, 3:11pm

Hi @VerheulJulia,
This is interesting.
Could you provide a little bit more information so that I can help.

What primers are you using for these sequences (i.e. 515F-806R)
Can you provide more information on the sequences that trimmomatic said should merge?

Thanks!

VerheulJulia · May 19, 2023, 9:55am

Thank you very much for your prompt response. The primers used for sequencing the V3-V4 region are as follows:

Trimming 16S V3V4:
Specific primer + adapter: ATTAGAWACCCBNGTAGTCC
Specific primer + adapterRead2: CTGSTGCVNCCCGTAGG

Sequences that are not matched by Trimmomatic are discarded and stored in a separate file. The selected sequences are included in the demux file that I have uploaded in the Trimmomatic section.
May I kindly inquire about the specific additional information you require regarding the sequences? Unfortunately, I am unable to upload the sequence files due to their large size, and the qzv files do not provide detailed observations about the sequences.

To potentially shed some light on the issue, I would like to mention that a few weeks ago, I attempted to use the trimmomatic sequences. However, I mistakenly used the v4 adapters instead of the v3-v4 adapters in the DADA2 step. Surprisingly, the merge percentage remained identical in both cases.

Please let me know what specific details or information you need regarding the sequences, and I will do my best to assist you further.

Please let me know if you need any further information. Thank you.

cherman2 · May 22, 2023, 5:32pm

Hi @VerheulJulia,
I am a little confused. This could be my lack of expertise in Trimmomatic but I am not sure how Trimmomatic provided a file containing sequences that had successfully paired? I looked through the docs and I couldn't find any mention of the merging sequences in the documentation. I looked at the demux file and those sequences are not merged so I am just a little confused.

If you are looking to trim out adapters, I would recommend cutadapt because it stays in the qiime2 system and your commands would be tracked through provenance. :qiime2:

I asked for your primers so that I could calculate the rough sequence length of your sequences. Most of the time sequences do not merge because they are not long enough to have overlap.

I am alittle confused about your primers and this might be effecting trimming:
This is a 806R primer. This is typically the end of the of v4 region.

This is the reverse complement of the above adapter. So It doesn't look like you have the beginning adapter.

I hope that helps!

VerheulJulia · May 23, 2023, 8:10am

Thank you very much for your detailed response. The Trimmomatic manual I have read does not provide a detailed explanation of the process. However, it does explain that the sequences are paired, and usually, the small fragments that cannot be merged are saved in a separate file. I have previously used Cutadapt, and I do acknowledge that it is very practical. However, after using it, I noticed two things: 1) the quality graph remained exactly the same as when not using it, and 2) the merge percentage was 0%.

The only way I have managed to slightly improve the quality of the sequences, as seen in the graphs, is by using Trimmomatic, achieving a merge of at least 16% (16% more than with Cutadapt).

The truth is that the primer sequence was provided by the sequencing company, and I can't even be 100% sure if it's correct. A couple of years ago, when we asked for the adapter sequences, they gave us v4 sequences. However, when I contacted them again last week, they said it was a copy-paste mistake and provided me with the sequence I included in my latest post. So, considering they have already given two different responses over time, I'm not sure if it's reliable information, but the primers they used are 341f – 806bR for the region V3-V4.

Despite that, I don't understand how they are complementary except for the last 3 bases.

The trimmomatic manual
http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

I have blasted the first overrepresented sequence (fastqc) from an non trimmed sequence (TGAGGAATATTGGTCAATGGCCGGAAGGCTGAACCAGCCAAGTCGCGTGA) and it does not seem to be a primer

I don't know if this give any clue about whats going on and why i don't get much merge sequences.

Thank you very much.
Julia

cherman2 · May 23, 2023, 5:55pm

Hello @VerheulJulia,
I am not sure about your primers, so I can not really speak on that. All I know is that this website says that your primer is the rev-complement.

It does not surprise me that you have better quality scores in your demux file after trimmomatic. You are doing some preliminary filtering so it makes sense that the quality would look better. I am not sure that its necessary to do that because Dada2 has very effective filtering and denoising methods.

Looking at the trimmomatic handbook it says :

For paired-end data, two input files, and 4 output files are specified, 2 for the 'paired' output
where both reads survived the processing, and 2 for corresponding 'unpaired' output where a
read survived, but the partner read did not

This is just saying that the one of the paired reads did not make it through the filtering not that they were going to merge. From what I understand, trimmomatic does not try to merge the sequences, it just does some filtering and trimming so its saying that some of your pairs of sequences made it through the filtering and some did not. It has no bearing on whether the "pairs" that made it through are going to merge.

As for your merging problem. I am going to do some rough math to understand the length of the region and the overlap necessary to be able to merge. Before I get into this math it is important to know that this math does not take into account variance in length, so its very rough math.

If you are using 341f – 806bR primers. Your sequences are roughly 465 in length (806-341). You also need about 12 bases of overlap in order to successfully merge. 465 +12 = 477. So adding your forward and reverse sequences together, they need to be roughly 477 bp long in order to merge.

You said in an earlier response that you are trimming at 230F and 152R. That wont merge because it is only 382 bp long and you need roughly 477bp to merge. That's why in this screenshot we see that you are losing all your sequences in the merging step.

For the trimmomatic run, you are trimming at 285 and 232 respectively.

If we add those up you get 517bp. They should realistically merge. And when we look at your stats.qzv, we can see that you are losing the majority of your sequences in the filtering step. Like for the first sample 15-P, you have 55361 reads and then 4232 reads after filtering but you retain the majority of your 4010 denoised reads at the merging step and come out with 3685. It says that 6.6% are making it though merging but that is 6.6% of the total, not 6.6% of the remaining sequences. You are losing the majority of this because of filtering.

Hope this helps!

Cobaya417 · May 23, 2023, 7:15pm

Dear Julia,
Trimmomatic does save paired sequences if both reads survives the filtering steps, if you want to check the % of joining sequences, you could try qiime vsearch join-pairs command (user --verbose). Also, as Chloe already states, try to maintain the sum of your lenghts above 460 (avergage V3V4 lenght)+20 (overlapping needed for dada2).

Kind regards,

cherman2 · May 23, 2023, 7:31pm

Hi @VerheulJulia and @Cobaya417
I just wanted to chime in. I would not suggest running the qiime vsearch join-pairs command without properly filtering. I believe the issue with the trimmomatic run is that it is not really getting passed dada2 filtering step and the issue is not the joining step. Since the issue is with low quality data and not merging, trying to merge data that hasn't been properly filtered, will probably cause more issues.

Also, as I described in my above post Dada2 only needs 12 bp overlap. It used to be 20 but it is now only 12bp.

Cobaya417 · May 23, 2023, 8:54pm

Dear @cherman2,
Thank you for the clarification and correction, yes, I tried to suggest to perform join-pairs after Trimmomatic, to check joining % after QC, if this percentage is low, probably it wont matter which parameters you use, as a you suggest, the data quality seems to be the issue.

Kind regards,

VerheulJulia · May 24, 2023, 10:35am

Hi @cherman2

Thank you very much for your response and your time. It is indeed true that I overlooked considering the fact that the v3-v4 region is larger than 300 bp. I hadn't done the calculations, and it makes perfect sense that if the fragment is 477 bp long and I leave fragments of 382 bp after trimming, the complete sequence cannot be obtained and therefore cannot be merged. However, what has confused me is that the quality graph in QIIME2 only represents sequences up to 300 bp in length, and in the size table, the maximum fragment size shown is 301 bp.

I am wondering if those are the actual sizes of the sequences or if QIIME2 has predetermined values for these parameters. I was expecting to have slightly longer sequences.

In the case of larger regions like v3-v4, the minimum quality values should maintain a ratio of 240/240 or 250/230, for example. However, based on other posts I have seen, it seems that very few quality graphs maintain high quality scores (QS) of 20 throughout the longer lengths that allow for a reliable analysis.
If I were to use very lenient trimming values in the DADA2 step, which means including sequences with QS values of 10-15, can I rely on the denoising performed by DADA2 to trust that there are not too many sequencing errors in my data? What do you suggest I do with the analysis considering the quality that I have?

Should I continue with the lenient trimming approach or should I consider stricter quality control measures to ensure the reliability of my results? Thank you for your guidance.

Thank you for your attention. I look forward to your response.
Julia

cherman2 · May 24, 2023, 4:13pm

Hi @VerheulJulia,
These are great questions!

The length of your reads are determined by the sequencing approach. It seems like you got a 2x300 (which means you got forward and reverse reads at 300 bps in length). There are also 2x250, 2x150 etc. This isn't determined by :qiime2: but instead its determined at the sequencing center.

It does not matter that your reads individually are not equal to 477, it only matters that together your forward and reverse reads add up to 477.

I am not quite sure what would be the best approach. I would suggest planning around with the truncation values and seeing if that gives you better results. If you could play around with the values and then upload some stats.qzv and some table.qzv. I could take a look and discuss what might be a reasonable course of action.

Hope that helps!

VerheulJulia · May 26, 2023, 10:25am

Hi @cherman2

Following your advice, I have conducted several attempts using three different truncation values. However, the maximum merge percentage I have achieved does not exceed 30%. I have attached a screenshot displaying the values obtained from the various attempts using DADA2. Additionally, I include the corresponding table and statistics from the experiment where I obtained more relevant results.

table_picasso260_217.qzv (443.4 KB)

DADA2picasso260_217.qzv (1.2 MB)

One of the alternatives that I have been exploring in parallel is to perform a single-end analysis by using only the R1 sequences. In this case, the percentage of samples that passed the filter increased to 68-76%, which is a more reasonable sequence percentage for analysis. However, if possible, I would like to avoid this alternative because even though the forward sequences are not as bad, they still contain errors, and without the consensus of both sequences, many errors could be introduced.

Since the paired-end analysis had idle moments due to computational time, I decided to perform the single-end analysis in parallel. I used the following databases/classifiers for taxonomic classification all of them using the same rep-seqs file and the same table
table-single.qzv (454.6 KB)
rep-seqs-single.qzv (318.0 KB)

Initially used the pre-trained classifier provided by QIIME2 (silva 138.99).

Then, I attempted to train a classifier with my own data following the tutorial "Training feature classifiers with q2-feature-classifier" using the SILVA 138.99 files for Feature[sequence] and Feature[taxonomy] provided on the resource page. However, the result was that it only identified one species, and the rest remained unassigned.

I also made the same attempt with version 132 of SILVA at 99% identity, and almost all fragments were classified as unassigned.

Then, I made an attempt with the new plugin of Qiime2, Greengene2, following the instructions provided in the section for regions other than V4. I'm not sure if this section is optimized, but I was able to generate the .qzv file for taxonomy. However, when I tried to generate the code for taxa_bar_plots, I encountered the following error.
taxonomy.gg2.1.qzv (1.2 MB)

> Traceback (most recent call last):
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-485>", line 2, in barplot
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
![Screenshot from 2023-05-26 12-38-58|580x500](upload://mvoGtEkG3jjN8L346qKIlv1KzM7.png)

    outputs = self._callable_executor_(scope, callable_args,
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 443, in _callable_executor_
    ret_val = self._callable(output_dir=temp_dir, **view_args)
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_taxa/_visualizer.py", line 41, in barplot
    collapsed_tables = _extract_to_level(taxonomy, table)
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_taxa/_util.py", line 42, in _extract_to_level
    collapsed_table = _collapse_table(table, taxonomy, level, max_obs_lvl)
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_taxa/_util.py", line 19, in _collapse_table
    raise ValueError('Feature IDs found in the table are missing from the '
ValueError: Feature IDs found in the table are missing from the taxonomy: {'0ea89af03489cd0042e244da95717306', 'd31d618b00510a936ad75bf753bd8111', 'b3b318ac62506795e8a41ee289431779', '60fc710df6345a9b2b7f31ac1500231c', '0640579fca999d04069119566d6bea23', '5cfc2e63284826be749078d590de2e8a', '7006270b2117277ace9e3eb44fed625d', '567c2a18b33de5a199fcc707626b04c7', '8b49a8e2010e6d41a7a18c1ce2e9b0ba', '487405dc03bc3775979159b5c5a54af4',

I understand that I have just provided you with a lot of information at once. I'm not sure if it would be preferable for me to generate an additional post explaining the issues I am encountering in classification using my own classifier or using GreenGen2.

I also tried to use the training-classifier tutorial using the old version of greengene 13_5 and this were the results

In those tries were I used the training-classifier i used this code for the extraction of the reads:

qiime feature-classifier extract-reads \
  --i-sequences ******
  --p-f-primer ATTAGAWACCCBNGTAGTCC \
  --p-r-primer CTGSTGCVNCCCGTAGG \
  --p-max-length 500 \
  --o-reads ****

Thank you once again for going above and beyond in your support.
Warmest regards,
Julia

cherman2 · May 30, 2023, 3:18pm

Hi @VerheulJulia,
It seems like 260-217 is getting around 30% passed the filtering set and then almost all of the reads are merging from there. Although that is not ideal it is probably the best you can get with this quality.

As for running single-end (just the R1 sequences) that could help get through the filtering steps because the forward read is usually better quality. I am not sure what you mean by merging in this step. Since you are doing just the R1 sequences there is no merging step, because you are not merging your forward and reverse reads. This is definitely a valid way to continue if you are concerned that you are losing too many reads.

Would you mind making a new question about the your classifier under user support. It is a little off topic and it would be helpful to answer these questions separately.
Thanks!

VerheulJulia · May 30, 2023, 4:10pm

Hi @cherman2,

my bad, I meant the porcentage of sequences that passed the filtre.

OK thank you a lot!