Qiime dada2 denoise-paired not the same length

lindd · January 13, 2018, 7:12pm

Hi,

I found that the reads after denoising by dada2 with fixed trunc length did not have the same sequence length as I set. Here is how I imported data and denoised them:
Data import:
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path pe-33-manifest
--output-path paired-end-demux.qza
--source-format PairedEndFastqManifestPhred33

qiime demux summarize
--i-data paired-end-demux.qza
--o-visualization paired-end-demux.qzv

The quality plot looks like this:

I want to truncate the sequence to 220 long by the following command:
qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trunc-len-f 220
--p-trim-left-r 0
--p-trunc-len-r 220
--o-representative-sequences ./seqlen220/rep-seqs-dada2.qza
--o-table ./seqlen220/table-dada2.qza

The reads saved in rep-seqs-dada2.qza are not the same length as I want (220 bp). Some are even longer than the original read length(301bp). Did I do something wrongly or miss some steps?

Thanks.

Best,

Dong

Mehrbod_Estaki · January 14, 2018, 5:45am

Hi @lindd,
It looks like everything is working as expected. After the denoising and chimera removing steps in dada2, it also joins your paired ends together. So even though you truncated your forward and reverse reads to a fixed length separately, once they are joint they give you a longer total length than each separate read alone would (see my masterpiece drawing below). This is what you want! Of course, if your primer sets are meant to be 100% overlap then there might be an issue, but as it is, I'd say you are good to go to your next step!

10bp Forward ==========
10bp Reverse ********==========
Total = 15 bp

Edit: * = blank spaces. I had to add * since the forum formatting wouldn't let me have that many spaces in a row.

lindd · January 14, 2018, 6:27am

Hi @Mehrbod_Estaki,

Many thanks for the explaination and nice drawing:wink:. Following this question, do you have any idea how to select the truncated length(e.g.,220) from the plot, any criteria we can follow? In addition, in which step those primer/reverse sequences are removed from the reads? I did not remember I input those sequences in any command.

Thanks.

Best,

Dong

Mehrbod_Estaki · January 15, 2018, 5:43pm

Hi @lindd,
No problem!

This is a question I've been pondering for a while now myself. The short answer is, no there are no real criteria to follow as far as picking your truncating parameters go, perhaps some guidelines though. See this explanation that may help you decide.
In general, you want to keep the length of reads as long as you can without allowing too many poor reads. I've heard on this forum that keeping the median above a score of 20 is a good starting point. Luckily your reads look pretty good to me so your current parameters are probably ok and long enough for an overlap to occur.

I'm not entirely sure I understand what you mean by this, could you clarify this please? If you are referring to your barcodes and adapter sequences, you have to make sure sure you have removed those prior to DADA2. This is very important, otherwise those 'non-biological' sequences will force most of your reads to be discarded. If the barcodes are all the same length you can use the --p-trim option to remove them easily.

lindd · January 16, 2018, 12:33am

I read those reads from fastq files which have been separated by samples in illumina basespace. Here is the mapping file ('mapinfo.csv') for reading those fastq files:
sample-id,absolute-filepath,direction
9_L001,$PWD/9_S75_L001_R1_001.fastq.gz,forward
9_L001,$PWD/9_S75_L001_R2_001.fastq.gz,reverse
...

For each fastq file, they looks like this:

@M02921:78:000000000-B5T5R:1:1101:13599:1282 1:N:0:CACCATCG+CTTGATTC
GGGATTAGATACCCCAGAACTGGAGG.....
+
CCCCCGGGGGGGFGFFEGGGGGGFG....

So I think barcode sequence is CACCATCG+CTTGATTC ? It has been already extracted as listed in fastq file, right? These files are imported by the following command:

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path mapinfo
--output-path paired-end-demux.qza
--source-format PairedEndFastqManifestPhred33

The output is then inputed into dada2 for denoising. Did I miss some steps like removing barcode sequence and primer sequences?

Thanks.

Dong

Mehrbod_Estaki · January 16, 2018, 6:47pm

Hi @lindd,

Your input commands and formatting looks fine to me! From the looks of it I would say you are right and that your facility has removed the barcodes and adapters sequences, and left the barcodes in the 1st line of the fastq file for record keeping. Though, you would have to actually check with your facility whether or not that is the case because as far as I know there is no requirement for that 1st line to include the barcodes once it has been demultiplexed. That just may be how your facility sets it up. Whatever that refers to at this point doesn't really affect your downstream analysis anyways, since they are already demultiplexed. But it is good practice to know the nature of your reads before you start to avoid having to backtrack.
If the samples were barcoded in-house, then you could also manually check to see if the first few character of line 2 match any of your barcodes. i.e. if yo used 8bp barcodes, check the first 8 characters.

One final check comes when you try to assign taxonomy to these sequence variants. If there are any barcodes and adapters left then you'll know for sure as they will fail to be assigned to any meaningful taxon due to those non-biological sequences.

lindd · January 16, 2018, 7:20pm

Hi @Mehrbod_Estaki,
Barcode sequence looks have been removed from reads since I cannot match them in all reads. Neither for forward primer sequence (GTGCCAGCMGCCGCGGTAA) and reverse primer sequence (CGACRRCCATGCANCACCT). So can I think that all of non-biological sequences(barcode, primer, adapter sequences) have been removed from reads in fastq files?

One last question, these sequences are from mice tissues. Do I need to do some special preprocessing besides the steps in 'moving picture' tutorial?

Thanks a lot for your patience.

Best,

Dong

Mehrbod_Estaki · January 16, 2018, 8:20pm

It sure looks that way to me!

Nope! The source of the samples doesn't matter here since you have amplified a target (I'm guessing bacteria in your case) and not anything else related to the mouse.

I'd say you're ready for the next step! Have fun!

lindd · January 16, 2018, 8:32pm

That sounds awesome. Thanks @Mehrbod_Estaki.

Dong

system · February 17, 2018, 2:32am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.