So I've played around with Qiime2 for a little bit, but I'm still very new to doing anything with sequence data.
I'm running q2 2021.11 on VM virtualbox on my laptop.
To the problem:
I have paired-end demultiplexed sequences from illumina sequencing run, where 27F and 519R 16S rDNA primers were used. I used the code from importing data Casava 1.8 paired-end demultiplexed fastq. I get my demux.qza file and I do the summary, where I see this:
Why are my sequences suddenly only 301nts? I'm expecting them to be a bit longer, atleast in the 400-s. Is this what happens during importing and generating the demux file?
(I also run into an issue at taxonomy stage, where I have very large portion of unassigned, which makes me come back to the very beginning and thinking this might be one of the issues..)
I also used Mothur through Galaxy.org to see if there are slight differences, and where I combine the forward and reverse reads I get a summary like this:
Start End NBases Ambigs Polymer NumSeqs
Minimum:1 293 293 0 3 1
2.5%-tile: 1 391 391 0 4 248117
25%-tile: 1 490 490 0 5 2481164
Median: 1 498 498 0 5 4962328
75%-tile: 1 514 514 0 5 7443491
97.5%-tile:1 588 588 13 18 9676538
Maximum:1 602 602 157 300 9924654
Mean: 1 501.061 501.061 1.18359 6.20446
of Seqs: 9924654
So, the number of sequences is the same. However, the length is what worries me.
If I now go on with the 301 truncation in dada2, I would lose a considerable amount of data in my mind.
The denoising I would run next:
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 6
--p-trim-left-r 6
--p-trunc-len-f 301
--p-trunc-len-r 301
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
Should I worry about the 301nt length? I understand the quality score interactive blot is a subsample of all of my samples, but shouldn't some of my reads at least go past 301? Perhaps the original data is at fault?
I'm still learning and trying to find different solutions to the problems I encounter using Q2, but this is something I can't seem to figure out.
Cheers,
El
Can you post all of the commands that you ran? That does seem a bit short. A Casava import should not result in a change in the length of the sequences. Based on what I know right now, I would say that your sequencing center may have already done pre-processing to your sequences. It is technically true that you will lose some data truncating it. However, keeping all of the sequences long is less important than having data that will denoise well without dropping more sequences than necessary because of poor quality in later positions, meaning that I think you might need to truncate the reads even more. DADA 2 can end up dropping reads if the average quality score gets too low, even if the beginning of the read would have been usable. A general practice that seems to work well for many applications is to truncate to the base position at which the bottom quartile(visually, the bottom of the box at that position) score falls below 30.
Thanks for the explanation!
I see about the truncation.
My data is microbial community data at specific timepoints over time, if this helps in any way.
In the beginning there's less biomass and at the end of the trial there's more biomass. I'm really interested to see how the composition changes over time. That's why I'm even worried about the length in the firsts steps.
The raw data that I have downloaded from the sequencing center was labelled like this:
T0-A-1_S15_L001_R1_001.fastq.qz
T0-A-1_S15_L001_R2_001.fastq.qz
etc.
The codes I ran were:
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path pHtrial-sequences-RF
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path 5aug/demux.qza
Truncating will cut any retained reads to the length specified and exclude any that are shorter.
For a more in-depth exploration of the denoising process check out this lecture and hands on tutorial from a workshop we put on in January.
With the primers you are using, it appears you are targeting a region that is 454 bp long, since each read is truncated to 301 bps, you are left with an overlap of ~74 bps (301 - 454/2) , which is plenty to join the read pairs together.
Thank you, @Keegan-Evans! Your help has put me on the right track.
I have taken a look at the lecture and tutorial.
I have decided with the following:
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trunc-len-f 296
--p-trunc-len-r 250
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
I got a pretty decent looking taxonomy bar plots, which I'm happy with and what I kind of expected to see. Previously, it has failed (a lot of unassigned), which I believe was because of my mistakes with the truncation steps.