Hello!
I am analyzing my microbiome with data I received from NCBI. I'm analyzing the microbiome for the first time, so I compared the results with the results from last year, but they are completely different.
To figure out what went wrong, I went through the exact same process in the same location, with the exact same data, and with the exact same commands I used last year. I realized that everything was the same up to import and cutadapt, but after denoising, I was getting completely different results:
Last year, there were 60 sequences left in repseq after running dada2, but this time, despite the exact same run, there were only 12 sequences left and I had no idea why.
To summarize,
I used the exact same data, the exact same method, and even the exact same location.
Are you able to post some of the output .qza files? The OUT_stats.qza files from both these runs would be helpful. I understand that some information is sensitive, but the provenance inside the .qza files gives us the information we need to investigate further.
It looks like the outputs are different; one is the FeatureTable[Frequency] and the other is the SampleData[DADA2Stats] file.
Their inputs are different too...
I'm thinking about the best way to double-check this:
Totally same procedure
Okay, here's what I think would be most helpful.
On the dada2 stats outputs from both runs, use qiime metadata tabulate to make that table of stats. (You included a picture, so you may have this already!)
Both file and picture, upper one is what I've done currently, and another one is had done last year.
Inputs are same, but I renamed differently not to confuse.
One cool feature of Qiime2 is that the files it saves (.qza and .qzv) include detailed logs about how everything in the files are made.
To view this history, I've uploaded the files to https://view.qiime2.org,
then clicked 'Provenance'
and selected an artifact from the flow chart
For example, we can check this:
I've opened both files with Qiime2 View and compared the input data side by side. I can see that the same 6 fastq files were used because their names and md5 hashes are identical. Good start!
Those settings are only shorter by 5 base pairs, so I'm surprised that it would cause almost all the reads to be filtered out.
(I don't think there's a problem with cutadapt, because the same settings were used and we can see the same number of input reads to DADA2.)
Here's my best guess: maybe the reads after running cutadapt are exactly 180 bp long, and setting trunc_len_r: 182 drops everything shorter during the filtering step?
Do the clues that I found make sense to you? I'm pretty confused myself...
OH!!! I finally got the same result!!!!!
I really appreciate with your help
I had no idea with what provenance was.. Now I know.
I have a real last question,
I want to truncate the first point where the q-score is lower than 20 as position, but if the qzv shows that this position is 180, can I just write 180 in --p-trunc-len-f/r?
Or if I get the tsv file and see that the position starts at 0, should I write 181?
If you want to trim where the average q-score is around 20, you can look for this position in the quality plot then use --p-trunc-len-f/r just like you described!
(Each basepair of each read will have its own quality score. If you want to trim each read where quality <20, you can do that, but variable length reads have their own issues...)
I used to think a lot about q-scores before using DADA2. Now I just try to select trim and trunc lengths so that a maximum number of my reads merge.