different feature counts after denoising

Melissa_Soh · March 24, 2021, 5:16am

Hello all,

I studying fish associated microbes and did three sequencing runs. the sequencing runs contained the following samples:

sequencing run 1: wild fish
sequencing run 2: wild fish and salmon
sequencing run 3: wild fish and salmon

in other words:

Salmon samples were ran in sequencing run 2, 3
Wild Fish samples were ran in sequencing run 1, 2 and 3

this was what I did:

import
use dada2 to denoise each sequencing run individually
use summarise to visualise the table.qzv

I did this for:

salmon only data (i selected only the salmon data from sequencing run 2 and 3)
salmon and wild fish data (i selected all the salmon and wild fish data from all three runs)

I used qiime2 view to look at my table.qzv for both the salmon only and salmon+wildfish data. In the “interactive sample detail” tab, i noticed that individual salmon samples had a different read count in the salmon only and the salmon+wildfish data. For example, I looked at Salmon1 under the “interactive sample detail” tab. For salmon only data, there were 5529 features. For the salmon+wildfish data, there were 5472 features.

My question is, why does the feature count change when the total number of samples being denoised changes?

Thank you for your time.

thermokarst · March 24, 2021, 2:45pm

Hi @Melissa_Soh!

I'm not quite sure I follow - do you think you could share some commands, or perhaps a visualization or two? The provenance would be pretty helpful here to help show us the big picture of what you have done to this point.

My initial interpretation of this question is that you have the same sample sequenced in more than one sequencing run, and you want to know why the feature counts are different. There are a whole host of reasons - PCR bias, sequencing bia, different q2-dada2 trim/trunc parameters, variation in the sample aliquots, etc.

I hope to see some visualizations to help address this for you!

Melissa_Soh · March 25, 2021, 5:57am

Hello @thermokarst

for the salmon and wild fish dataset, this is what i did:

#import three sequencing runs
qsub -q batch HPC-1b.sh
qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path M1_251bp_salmon_wild_manifest
–output-path single-end-demuxM1_251bp_salmon_wild.qza
–input-format SingleEndFastqManifestPhred33
qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path M2_151bp_salmon_wild_manifest
–output-path single-end-demuxM2_151bp1_salmon_wild.qza
–input-format SingleEndFastqManifestPhred33
qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path M3_151bp_salmon_wild_manifest
–output-path single-end-demuxM3_151bp_salmon_wild.qza
–input-format SingleEndFastqManifestPhred33

#denoised separately
qsub -q batch HPC-3.sh
qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM1_251bp_salmon_wild.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM1_251bp_salmon_wild.qza
–o-representative-sequences rep-seqsM1_251bp_salmon_wild.qza
–o-denoising-stats denoising-statsM1_251bp_salmon_wild.qza
qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM2_151bp1_salmon_wild.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM2_151bp1_salmon_wild.qza
–o-representative-sequences rep-seqsM2_151bp1_salmon_wild.qza
–o-denoising-stats denoising-statsM2_151bp1_salmon_wild.qza
qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM3_151bp_salmon_wild.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM3_151bp_salmon_wild.qza
–o-representative-sequences rep-seqsM3_151bp_salmon_wild.qza
–o-denoising-stats denoising-statsM3_151bp_salmon_wild.qza

#visualise
qsub -q batch HPC-4.sh
qiime metadata tabulate
–m-input-file denoising-statsM1_251bp_salmon_wild.qza
–o-visualization denoising-statsM1_251bp_salmon_wild.qzv
qiime metadata tabulate
–m-input-file denoising-statsM2_151bp1_salmon_wild.qza
–o-visualization denoising-statsM2_151bp1_salmon_wild.qzv
qiime metadata tabulate
–m-input-file denoising-statsM3_151bp_salmon_wild.qza
–o-visualization denoising-statsM3_151bp_salmon_wild.qzv
qiime feature-table tabulate-seqs
–i-data rep-seqsM1_251bp_salmon_wild.qza
–o-visualization rep-seqsM1_251bp_salmon_wild.qzv
qiime feature-table tabulate-seqs
–i-data rep-seqsM2_151bp1_salmon_wild.qza
–o-visualization rep-seqsM2_151bp1_salmon_wild.qzv
qiime feature-table tabulate-seqs
–i-data rep-seqsM3_151bp_salmon_wild.qza
–o-visualization rep-seqsM3_151bp_salmon_wild.qzv
qiime feature-table summarize
–i-table tableM1_251bp_salmon_wild.qza
–o-visualization tableM1_251bp_salmon_wild.qzv
–m-sample-metadata-file salmon_wild_metadata.csv
qiime feature-table summarize
–i-table tableM2_151bp1_salmon_wild.qza
–o-visualization tableM2_151bp1_salmon_wild.qzv
–m-sample-metadata-file salmon_wild_metadata.csv
qiime feature-table summarize
–i-table tableM3_151bp_salmon_wild.qza
–o-visualization tableM3_151bp_salmon_wild.qzv
–m-sample-metadata-file salmon_wild_metadata.csv

for the salmon only data I did this:

#import
qsub -q batch HPC-1b.sh
qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path M2_151bp1_salmononly_manifest
–output-path single-end-demuxM2_151bp1_salmononly.qza
–input-format SingleEndFastqManifestPhred33
qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path M3_151bp_salmononly_manifest
–output-path single-end-demuxM3_151bp_salmononly.qza
–input-format SingleEndFastqManifestPhred33

#denoise separately
qsub -q batch HPC-3.sh
qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM2_151bp1_salmononly.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM2_151bp1_salmononly.qza
–o-representative-sequences rep-seqsM2_151bp1_salmononly.qza
–o-denoising-stats denoising-statsM2_151bp1_salmononly.qza
qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM3_151bp_salmononly.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM3_151bp_salmononly.qza
–o-representative-sequences rep-seqsM3_151bp_salmononly.qza
–o-denoising-stats denoising-statsM3_151bp_salmononly.qza

#visualise
qsub -q batch HPC-4.sh
qiime metadata tabulate
–m-input-file denoising-statsM2_151bp1_salmononly.qza
–o-visualization denoising-statsM2_151bp1_salmononly.qzv
qiime metadata tabulate
–m-input-file denoising-statsM3_151bp_salmononly.qza
–o-visualization denoising-statsM3_151bp_salmononly.qzv
qiime feature-table tabulate-seqs
–i-data rep-seqsM2_151bp1_salmononly.qza
–o-visualization rep-seqsM2_151bp1_salmononly.qzv
qiime feature-table tabulate-seqs
–i-data rep-seqsM3_151bp_salmononly.qza
–o-visualization rep-seqsM3_151bp_salmononly.qzv
qiime feature-table summarize
–i-table tableM2_151bp1_salmononly.qza
–o-visualization tableM2_151bp1_salmononly.qzv
–m-sample-metadata-file salmon_metadata.csv
qiime feature-table summarize
–i-table tableM3_151bp_salmononly.qza
–o-visualization tableM3_151bp_salmononly.qzv
–m-sample-metadata-file salmon_metadata.csv

When I looked at tableM3_151bp_salmononly.qzv and tableM3_151bp_salmon_wild.qzv, I see that some samples have different feature counts in each table. For instance, for sample 111 (a salmon sample), there were 5529 features in the tableM3_151bp_salmononly.qzv and 5472 features tableM3_151bp_salmon_wild.qzv. I double checked my metadata and confirmed that there is only one sample 111. There is no problem of merging two samples together.

My guess is that feature count is affected by number of samples denoised together.
I denoised run 3 by using the previously mentioned script:
qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM3_151bp_salmon_wild.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM3_151bp_salmon_wild.qza
–o-representative-sequences rep-seqsM3_151bp_salmon_wild.qza
–o-denoising-stats denoising-statsM3_151bp_salmon_wild.qza

and

qiime dada2 denoise-single
–i-demultiplexed-seqs single-end-demuxM3_151bp_salmononly.qza
–p-trim-left 12
–p-trunc-len 150
–p-n-threads 8
–o-table tableM3_151bp_salmononly.qza
–o-representative-sequences rep-seqsM3_151bp_salmononly.qza
–o-denoising-stats denoising-statsM3_151bp_salmononly.qza

I suspect that having more samples denoised together (ie instead of just denoising salmon data, i am denoising salmon+wildfish data together) affects the final feature count. Perhaps dada2 removed more as noise when more samples are denoised together? This is just my suspicion. Is it possible?

basically, my question is:
why does adding wild samples to denoise together with salmon samples affect the individual salmon sample feature count (compared to when denoising the salmon samples only)?

I hope I managed to make myself understandable. Thank you for your time!

thermokarst · March 25, 2021, 3:03pm

I think you might be attributing the number of samples in the run to the differences in feature counts - while this is one factor, there are many others at play here:

You can't attribute that discrepancy to any one factor, in my opinion.

To be honest, this is pretty remarkable to me that the counts are as close as they are! I wouldn't have been surprised if one run turned up 5000 counts, and another 10000, for the same sample. There can be a lot of variation in these techniques, unfortunately.

@jwdebelius and @colinbrislawn - do you have anything you would like to add to this?

Melissa_Soh · March 26, 2021, 3:13am

Hello @thermokarst ,
Thank you for the reply.

I ran the same sequencing results in two different denoising steps (one with only the salmon samples and one with both the salmon samples and the wild fish samples). In other words, the salmon only data is a subset of the salmon and wild fish data. The salmon and wild fish data was denoised together and the salmon subset denoised in another script.

Can I still say that the differences in feature count for individual samples are still expected?

Cheers,
Mel

thermokarst · March 26, 2021, 4:07am

Can you clarify what that means with respect to your initial post:

The initial post sounds like there were three separate sequencing runs, but this most recent post sounds like you might be working with just one sequencing run, that you have subsampled?

With DADA2, it is usually best to run DADA2 once per sequencing run, which means for multiple sequencing runs you would denoise each run separately and merge the results at the end. If you take samples out of the sequencing run prior to denoising you can subtly change the error model that DADA2 produces, because DADA2 consumes some of the reads as part of the error model tuning step, so while you won't usually see a drastic change when removing a few samples, it is still usually considered best practice to denoise the entire run at once, if possible. More data usually means that DADA2 will perform better. I recommend reading the DADA2 paper and docs to familiarize yourself with the tool and the algorithm (if you haven't already!): DADA2: Fast and accurate sample inference from amplicon data with single-nucleotide resolution

I think so!

Melissa_Soh · March 26, 2021, 6:04am

I have three different sequencing runs:

sequencing run 1: wild fish
sequencing run 2: wild fish and salmon
sequencing run 3: wild fish and salmon

I denoised the wild fish and salmon data as such:

1 denoising step for sequencing run 1: wild fish
1 denoising step for sequencing run 2: wild fish and salmon
1 denoising step for sequencing run 3: wild fish and salmon
merge all three denoising steps together

I denoised the salmon data as such:

1 denoising step for sequencing run 2: salmon (subset of whole data set)
1 denoising step for sequencing run 3: salmon (subset of whole data set)
merge both denoising steps together

I noticed differences in the feature counts of individual salmon samples when comparing the results of
"1 denoising step for sequencing run 3: wild fish and salmon" and
"1 denoising step for sequencing run 3: salmon (subset of whole data set)".

I think this part answers my question. The presence/ absence of the wild samples from sequencing run 3 in the two denoising steps resulted in a different error model produced by DADA2 and hence gave a different feature count for the individual salmon samples. Please correct me if i am wrong.

Does this mean that it is not right to subset my data before denoising? If I am only interested in the salmon data for one part of my analysis, should I denoise the whole sequencing run (1 denoising step per sequencing run) and then remove the non-salmon samples? Or is it okay to do it my current way, which is to remove the non-salmon samples before denosing the sequencing runs (1 denoising step per sequencing run)?

Thank you for your patience!

thermokarst · March 26, 2021, 2:32pm

Perfect, thanks for clearing that up - that workflow makes sense - thanks for bearing with me!

Yes indeed!

Not necessarily "not right," its just usually best practice to use as much data as possible in this step. Personally I would denoise the whole sequencing run (or your portion of it if you shared the run with another user) - then you can filter your feature table downstream to just the samples you care about.

Thank you!

system · April 26, 2021, 8:33pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.