Totally same procedure, totally different results

Hello!
I am analyzing my microbiome with data I received from NCBI. I'm analyzing the microbiome for the first time, so I compared the results with the results from last year, but they are completely different.

To figure out what went wrong, I went through the exact same process in the same location, with the exact same data, and with the exact same commands I used last year. I realized that everything was the same up to import and cutadapt, but after denoising, I was getting completely different results:
Last year, there were 60 sequences left in repseq after running dada2, but this time, despite the exact same run, there were only 12 sequences left and I had no idea why.

To summarize,

  • I used the exact same data, the exact same method, and even the exact same location.
  • I get completely different results

Can you explain the possible reasons?

  • qiime2 version is 2023.5.1
  • The commands I used are:
qiime dada2 denoise-paired \
--i-demultiplexed-seqs IN.qza \
--p-trunc-len-f 232 --p-trunc-len-r 230 \
--o-table OUT_table.qza \
--o-representative-sequences OUT_repseq.qza \
--o-denoising-stats OUT_stats.qza

This is one that I had done before,

This is one progressed currently,

If you need any more information, please tell me.
EY

Hello @eysung,

Yes, we can help you investigate this.

Are you able to post some of the output .qza files? The OUT_stats.qza files from both these runs would be helpful. I understand that some information is sensitive, but the provenance inside the .qza files gives us the information we need to investigate further.

Hi @colinbrislawn
I really appreciate with your reply!

This is the part of I had done before:


and this is the entire stats I currently did:

Upload the two .qza files together, just in case.
Any other things you need, please tell me!
Thank you :slight_smile:

EY
A_V3_stats.qza (18.1 KB)
denoise-table_mockA2.qza (20.7 KB)

Thank you for posting these files!

Direct links to qiime2-view:

It looks like the outputs are different; one is the FeatureTable[Frequency] and the other is the SampleData[DADA2Stats] file.

Their inputs are different too...

I'm thinking about the best way to double-check this:

Totally same procedure

Okay, here's what I think would be most helpful.
On the dada2 stats outputs from both runs, use qiime metadata tabulate to make that table of stats. (You included a picture, so you may have this already!)

Then upload that .qzv file.

Thanks!

Sorry, I posted wrong file :frowning:

Both file and picture, upper one is what I've done currently, and another one is had done last year.
Inputs are same, but I renamed differently not to confuse.


Thank you!

1 Like

Thanks for posting that!

One cool feature of Qiime2 is that the files it saves (.qza and .qzv) include detailed logs about how everything in the files are made.

To view this history, I've uploaded the files to https://view.qiime2.org,
then clicked 'Provenance'
and selected an artifact from the flow chart

For example, we can check this:

I've opened both files with Qiime2 View and compared the input data side by side. I can see that the same 6 fastq files were used because their names and md5 hashes are identical. Good start!

I looked for differences in other steps and found slightly different settings when running DADA2.

Those settings are only shorter by 5 base pairs, so I'm surprised that it would cause almost all the reads to be filtered out.

(I don't think there's a problem with cutadapt, because the same settings were used and we can see the same number of input reads to DADA2.)

Here's my best guess: maybe the reads after running cutadapt are exactly 180 bp long, and setting trunc_len_r: 182 drops everything shorter during the filtering step?

Do the clues that I found make sense to you? I'm pretty confused myself... :melting_face:

1 Like

OH!!! I finally got the same result!!!!!
I really appreciate with your help :face_holding_back_tears:
I had no idea with what provenance was.. Now I know.

I have a real last question,
I want to truncate the first point where the q-score is lower than 20 as position, but if the qzv shows that this position is 180, can I just write 180 in --p-trunc-len-f/r?
Or if I get the tsv file and see that the position starts at 0, should I write 181?

Again, I really thank you for your advice!!!

EY

Okay cool! I'm glad we figured it out!

If you want to trim where the average q-score is around 20, you can look for this position in the quality plot then use --p-trunc-len-f/r just like you described!

(Each basepair of each read will have its own quality score. If you want to trim each read where quality <20, you can do that, but variable length reads have their own issues...)

I used to think a lot about q-scores before using DADA2. Now I just try to select trim and trunc lengths so that a maximum number of my reads merge. :person_shrugging:

1 Like