Only 1/73 sample left after dada denoise?

Hi

I am totally new user of qiime2 and don’t have experience in using Qiime1.

Recently I start to use QIIME2(2019.10) to do 16S analysis. And I met a problem regarding filtering reads I guess.

I used conda to install it. (as describing in this linkage about linux system:https://docs.qiime2.org/2019.10/install/native/#install-qiime-2-within-a-conda-environment)

Import command:import pair-end reads(clean data)

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \

--input-path se-64-manifest \

--output-path /sbidata/lzhang/201911_hydractinia/RawData/16S_data/output/demux.qza \

--input-format PairedEndFastqManifestPhred64

se-64-manifest format

sample-id,absolute-filepath,direction

10,/sbidata/lzhang/201911_hydractinia/RawData/F19FTSEUHT1464_METzccM/BGI_results/Clean_Data/10/10_1.fq.gz,forward

10,/sbidata/lzhang/201911_hydractinia/RawData/F19FTSEUHT1464_METzccM/BGI_results/Clean_Data/10/10_2.fq.gz,reverse

11,/sbidata/lzhang/201911_hydractinia/RawData/F19FTSEUHT1464_METzccM/BGI_results/Clean_Data/11/11_1.fq.gz,forward

11,/sbidata/lzhang/201911_hydractinia/RawData/F19FTSEUHT1464_METzccM/BGI_results/Clean_Data/11/11_2.fq.gz,reverse

...,...,...

Description of my sample after import into qiime2


I totally have 73 samples before doing denoising
Qualtiy as belows

Command as follows: denoise and visualize step

based on the qulity plots, i set the trim length 300, which means I don’t want to trim reads. Cause it seems the reads have good quality. Right?

qiime dada2 denoise-paired \
   --i-demultiplexed-seqs demux.qza \
   --p-trunc-len-f 300 \
   --p-trunc-len-r 300 \
   --o-table table.qza \
   --o-representative-sequences rep-seqs.qza \
   --o-denoising-stats denoising-stats.qza


qiime feature-table summarize \
 --i-table table.qza \
 --o-visualization table.qzv \ 
 --m-sample-metadata-file 201912_hydractinia_metadatafile_Qiime2.tsv

Then I used qiime tools view to check the results

I got only one sample left …

stats of this procedure

It seems that other samples don’t pass the filter parameters?
So I check the parameter of overlap, the default overlap is 20.

In order to check if this is the problem. I rerun the denoise procedure by setting 0.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs /sbidata/lzhang/201911_hydractinia/RawData/16S_data/output/demux.qza \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 0 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

And this time, I got results of total 77 samples in table.

I have a few questions :slight_smile:
I have read some questions relating this, but it seems that no results are like mine…only 1 sample left.

Q1. I am not sure if I set the parameter to 0, then the quality of the result will be low? Can I set it to 0? And will affect the results? If it affects results, do you have any suggestions for me?
Q2. I also have merged reads which are results from the sequencing company. Do I need to analyse directly with the merged reads?
Q3. Should I using deblur instead and try again?

Trully thanks for your help!

Hello @Lu_Zhang,

Welcome to the forums! :qiime2:

Thank you for your detailed question and all the dada2 output logs. I think I can answer all your questions.

If you set both --p-trunc-len-f and --p-trunc-len-r to zero, no trimming will be performed. This is why this setting worked well for your data set. If you set these to 300, then reads will be trimmed at 300 and reads under 300 bp will be removed. I think this is why so many samples were lost when you set these to 300.

Nope! It’s best to start with raw data with as little pre-processing as possible. In fact, dada2 needs the reads before joining to work.

You could. Dada2 and deblur are different complementary methods. You can could read about why they were made and their approach to denoising, and choose which method you like best. Or you could try them both!

Colin

2 Likes

Hi @colinbrislawn,

So excited to see your reply and appreciated for your reply!:smiley:

What I feel confused is that the totaly length of the reads is 300bp, so I guess there is no reads has been removed if I set 300bp?


And also if I set it to 0, is there any affects on the next denoise step? Casue if I set it to 0, the overlap 20 parameter is also disabled.

Thanks very much!

1 Like

Let’s figure out what trimming at 300 bp does. We need all the data we have. Can you post the Demultiplexed sequence length summary from that demux.qzv file?

Maybe… but I think something else might be happening. If any of your reads are less than 300 bp, then these will be removed. And the sequence length summary will tell us how long your reads are and how many of them would be removed at 300 bp.

I’m not sure… The zero setting will cause reads not to be truncated at all, and this could indirectly affect pairing and denoising. This is a good question for the dada2 developer, @benjjneb

Colin

I don’t see why the reads are getting removed with the trunc-len of 300 either, but maybe the reads are for some technical reason just less than 300 nts? (e.g., 299)? Then they would get removed. This sometimes happens if sequencing centers include a couple of technical bases at the start of the reads and then strip them out before returning the data.

Based on your quality plots though, the data looks high quality, and a high fraction of reads are passing through your workflow when using the non-truncated data, so going with trunc-len 0 appears to be just fine here.

My one concern is that you are losing such a high fraction of reads as chimeras, which almost always means that you have unremoved primers at the start of your reads. You should check whether primers are present on these reads, and remove them either with cutadapt, or easily by using the trim-left parameter in the dada2 denoise workflow.

1 Like

Hi :slight_smile:

Yes, you’re totally right, some of my reads are less than 300bp. Here attached is my sequence length summary.

So the principle is that if my parameter is 300bp, it will remove all the reads less than 300bp and then merge them? I am not very clear about the reason.

In that case, I should set a number that less than the length of the shortest reads?
I tried command as follows:

qiime dada2 denoise-paired \
      --i-demultiplexed-seqs /sbidata/lzhang/201911_hydractinia/RawData/16S_data/output/demux.qza \
      --p-trunc-len-f 240 \
      --p-trunc-len-r 220 \
      --o-table table.qza \
      --o-representative-sequences rep-seqs.qza \
      --o-denoising-stats denoising-stats.qza

results show as below,

Trully thanks.

As @benjjneb mentioned, the high fraction of reads as chimeras should be worried. Next I would check this. Thanks:)

1 Like

That’s right. Because 98% of your reads are <300 bp, trimming at 300 will remove 98% of your reads. (If you take a look at the dada2 paired docs, you can read the full description of this setting and see that “Reads that are shorter than this value will be discarded.”)

Maybe… but I would take Ben’s advice here:

The goal is to pick lengths that remove errors and find a setting that the most reads get merged. And I think 0 might be perfect!

Great work! I think you are making good progress.

Colin

Hi :smiley:

Acutually @benjjneb is totally right. I rechecked the data, and contact the sequecing company. They told me that only the merged fasta files are without primers. While the clean pair-end data files are with primers. And I simply trim it by using trim-left in the dada2 denoise workflow. It is very convenient!

So after using the command below:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs /sbidata/lzhang/201911_hydractinia/RawData/16S_data/output/demux.qza \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 0 \
  --p-trim-left-f 20 \
  --p-trim-left-r 20 \
  --p-n-threads 24 \
  --o-table table20.qza \
  --o-representative-sequences rep-seqs20.qza \
  --o-denoising-stats denoising-stats20.qza

Statistics as follows:


Really thanks your great help! @colinbrislawn @benjjneb

However, for the problem of the percentage of chimeric, I still fee confused. The overall non-chimeric rate is imporved definitely after removing the primer.
But some of the samples(10+/73) percentage is still not very high around 50%.

I list the samples with low percentage of non-chimeric.

Do you think this is okay? or I should set it R pacakge dada2.
I have also tried these data in R package by setting minimum overlap 12

image

However, it seems not be improved…

Thanks

2 Likes

Hello @Lu_Zhang,

Glad you got trimming and filtering working well on your data!

Looks like both the R package and Qiime 2 plugin are getting pretty good results, but have 50% or so flagged as chimera. This is not impossible, especially with higher PCR cycle counts. Did you do 15x, 20x, or 30x PCR cycles?

If you want to lower the chimeric numbers, you could change the --p-min-fold-parent-over-abundance or --p-chimera-method described here.

Colin

3 Likes

Dear @colinbrislawn,

As you recommended, I tried to set the --p-min-fold-parent-over-abundance
command as follows:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs /sbidata/lzhang/201911_hydractinia/RawData/16S_data/output/demux.qza \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 0 \
  --p-trim-left-f 20 \
  --p-trim-left-r 20 \
  --p-n-threads 24 \
  --p-min-fold-parent-over-abundance 8\
  --o-table table8.qza \
  --o-representative-sequences rep-seqs8.qza \
  --o-denoising-stats denoising-stats8.qza

And the table is as follows,the nonchimera level is improved, but limited to low merge percentage?


I also tried it in R package dada2 setting the maxmismatch to 3.

I got the results as follows:
image
I can see the merge percentage is improved.

I am not sure if I could just use this result ( with some sample 50% non-chimera ?)

To test the results, I also do taxonomy analysis with the model in Moving picture tutorial:

use the default classifier

command as follows:

qiime feature-classifier classify-sklearn \
  --i-classifier gg-13-8-99-515-806-nb-classifier.qza \
  --i-reads rep-seqs20.qza \
  --o-classification taxonomy20.qza

qiime metadata tabulate \
   --m-input-file taxonomy20.qza \
   --o-visualization taxonomy20.qzv

qiime taxa barplot \
  --i-table table20.qza \
  --i-taxonomy taxonomy20.qza \
  --m-metadata-file 201912_hydractinia_metadatafile_Qiime2.tsv \
  --o-visualization taxa-bar-plots20.qzv

Taxonomy barplots as follows:
taxa-bar-plots20.qzv (2.2 MB)

I found lots of unassigned taxonomy…

Then as the tutorial said, I also train my own classfier

command as follows:

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path gg_13_8_otus/rep_set/99_otus.fasta \
  --output-path 99_otus.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path gg_13_8_otus/taxonomy/99_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza

qiime feature-classifier extract-reads \
  --i-sequences 99_otus.qza \
  --p-f-primer ACTCCTACGGGAGGCAGCAG \
  --p-r-primer GGACTACHVGGGTWTCTAAT \
  --o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier classifier.qza

qiime feature-classifier classify-sklearn \
  --i-classifier training-feature-classifiers/classifier.qza \
  --i-reads rep-seqs20.qza \
  --o-classification taxonomy20_new.qza

qiime metadata tabulate \
  --m-input-file taxonomy20_new.qza \
  --o-visualization taxonomy20_new.qzv

qiime taxa barplot \
  --i-table table20.qza \
  --i-taxonomy taxonomy20_new.qza \
  --m-metadata-file 201912_hydractinia_metadatafile_Qiime2.tsv \
  --o-visualization taxa-bar-plots20_new.qzv

Results as follows:
taxa-bar-plots20_new.qzv (2.4 MB)

They seem to have very different results…

It makes me confused, as I couldn’t solve the problem of low percentage of merge as well as chimeria. I don’t know if the downstream results are confident or not. Also, the big difference between taxonomy classifier, is that normal?
Also the sample6 has very high non-chimeric level, but in the taxonomy level, most of this sample is only assigned to one genus. I thought high quality reads would be assigned to more information?

Sorry to bother you again…

Best,
Lu

1 Like

Hello @Lu_Zhang,

You are making great progress! I appreciate your detailed posts. Let’s dive in! :swimming_man:

So we have lowered the percent chimeric (great!). In your first post you got merged percentages around 80% and 90%. Maybe you could run those settings again, now with the --p-min-fold-parent-over-abundance 8 setting.

It looks like that helped. I think you could safely use any one of these inputs for your downstream analysis. :+1:

I believe you can use those tables for downstream analysis just fine. Let’s see what other forum users agree!


Yes. Training your own classifier to matches your primers should give the best results, and I think that’s what we see here. Proteobacteria remain the 2nd most common microbe, and the unclassified k__bacteria;p__ now become p_Tenericutes. This is great news!

It’s possible that this sample was contaminated, or maybe there was just a lot of p_Tenericutes inside of it.


I think you are doing great and are getting very good results. It sounds like you are still worried if you are on the right path, and I think this care is good. I’m glad you are thinking about this.

If you want to make sure your methods work, the best method is to use positive control samples with known compositions. Did you include any positive control samples in your study?

Colin

2 Likes