Deblur analysis merged sequence

Lu_Yang · December 4, 2017, 4:35am

Hi, community,

I have the merged fastq file from FLASH. And I want to use deblur for analysis. Beacause DADA2 do not support the merged sequence analysis.
I use the code as below.
I all merged fastq files in the folder named fj-joined, and made a manifest as below.
sample-id,absolute-filepath,direction
101C1,$PWD/test/101C1r.fastq.gz,forward
101C2,$PWD/test/101C2.fastq.gz,forward
101C3,$PWD/test/101C3.fastq.gz,forward

Then I use the newly released import code.

qiime tools import
--input-path fj-joined/manifest
--output-path fj-joined-demux.qza
--type SampleData[JoinedSequencesWithQuality]
--source-format SingleEndFastqManifestPhred33
qiime demux summarize
--i-data fj-joined-demux.qza
--o-visualization fj-joined-demux.qzv
As a result, my reads are showing as follows,

Then I use the code
mv fj-joined-demux.qza demux-joined.qza

qiime quality-filter q-score-joined
--i-demux demux-joined.qza
--o-filtered-sequences demux-joined-filtered.qza
--o-filter-stats demux-joined-filter-stats.qza

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-joined-filtered.qza
--p-trim-length 300
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--p-sample-stats
--o-stats deblur-stats.qza
--p-jobs-to-start 10
Finally I only got the following sequences numbers, which seems similar with the former wrong procedure I did under DADA2, which regards merged sequence as the single ended sequence.

I do not know where is wrong with my procedure.
Thanks in advance.

Nicholas_Bokulich · December 4, 2017, 4:57pm

Hi @Lu_Yang,
Sorry to hear QIIME2 has been giving you trouble! This workflow looks fine to me (and it sounds
like you are following this tutorial) — but @wasade may be able to give better guidance on deblur inputs.

Is this the full file? What is the total sequence count vs. the total number of demultiplexed sequences? Sharing your fj-joined-demux.qzv and deblur-stats.qzvand table.qzv could help us assess this issue more fully.

Why not take advantage of the read merging performed in deblur/dada2 in QIIME2? It does not sound like FLASH performs any additional quality checks, but I wonder if it is somehow interfering (e.g., the alignments are not quite right, causing sequences to look like chimera?) I'm not asking you to defend your approach — each user has their own unique needs before sending their data into QIIME2 — nor am I trying to push a particular approach, but it might be worth comparing your current workflow to importing non-merged reads and processing with dada2 or deblur (either as PairedEndSequencesWithQuality or as JoinedSequencesWithQuality in deblur following the tutorial linked above) just to make sure that FLASH isn't causing some sort of incompatibility. Just a sort of "sanity check" worth exploring.

The final possibility is just that the sequences are very noisy and deblur is working as intended. Are they 16S sequences? What types of samples? @wasade may have more insight on what could be causing large numbers of sequences to be filtered.

(you could also try otu picking in QIIME2 for comparison — this does not perform any denoising and so should deliver a higher yield of seqs/sample, but you must ask yourself then: do I trust these data? Are there peculiarities about your data (e.g., a novel marker gene) that lead you to believe that denoising methods are not optimal for your special use case?)

Thanks for your patience!

wasade · December 4, 2017, 5:36pm

Thanks, @Nicholas_Bokulich. One comment, Deblur does not explicitly join reads but it is agnostic to whether reads are joined upstream.

@Lu_Yang, is it possible the trim length is too long causing many reads to drop? Just to verify, are these are Illumina data? The quality scores seem quite high along the length of the read, so just want to make sure -- Deblur is not defined for non-Illumina data.

Best,
Daniel

Lu_Yang · December 4, 2017, 6:04pm

Hi, @wasade and @Nicholas_Bokulich,

Thanks for your reply. My data are Illumina 16S data with primer 515F-806R.
The following are the results of other qzv file during the processing.

The result seems strange in the deblur-stats.
Thanks.

Nicholas_Bokulich · December 4, 2017, 6:53pm

Hi @Lu_Yang,
The deblur stats indicate that 96-99% of the reads in each sample do not pass the min-size threshold (see the fraction-artifact-with-minsize column, third from left). The deblur help docs describe this parameter:

  --p-min-size INTEGER            [default: 2]
                                  In each sample, discard all
                                  features with an abundance less than
                                  min_size.

So in other words almost every sequence is observed < 2 times in each sample (and @wasade pls correct me if I'm misinterpreting this). This would indicate very noisy sequences — and I suspect that FLASH or any other upstream processing steps may be generating noisy sequences. I would recommend following one of the alternative steps I suggested above as a sanity check (i.e., is it the data or is it the upstream processing causing problems?)

I hope that helps!

Lu_Yang · December 4, 2017, 7:56pm

Hi, @Nicholas_Bokulich,

Thanks for the info. Honest, I can not import my raw sequence in QIIME2.
Here is my sequence situation, which is sequenced by a company. I got R1 and R2 not demultiplexed, and a txt file of barcodes of each sample. Barcodes randomly exist in the R1 or R2, also the forward primer is also random exist in the R1 or R2, reverse primer is also the same way.
They told me there is no way to get the separate demultiplexed forward and reverse sequence, the only way I can do is to merge the sequence and then demultiplex.
They also gave me the having merged demultiplexed sequence for each sample(merged by FLASH).
So I use 3 of the demultiplexed samples to have a try on DEBLUR before I ran all my samples.
Honest to say, I am rare to see this kind of R1 and R1. Is there a way for me to dealing with this kind of data?
Thanks.

Lu_Yang · December 4, 2017, 10:10pm

Hi, @wasade,

I think it again and again. May I ask a more question?
In the code
qiime deblur denoise-16S
--i-demultiplexed-seqs demux-joined-filtered.qza
--p-trim-length 300
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--p-sample-stats
--o-stats deblur-stats.qza
--p-jobs-to-start 10
'--p-trim-length 300' means trunctcate the reads at 300nt, if a reads is greater than 300nt, it will be dropped? And all the sequence shorter or equals than 300 will be remained, is that the case? I truncate here based on the most sequences situated. But honest to say the 515F-806R primer will give sequence around 250 length.
So do you have other suggestions about my truncate? At which number is much more safe?

Enclosed is my demux.qzv file.
fj-joined-demux.qzv (283.1 KB)

Looking forward to your suggestions.

Matilda_H-D · December 5, 2017, 12:08am

Hi @Lu_Yang, my understanding of Deblur is that it needs all the sequences to be exactly the same length for it to work.

So, I think when you choose the trim length 300, all the reads longer than 300bp will be trimmed so that they are 300bp long, and any reads shorter than 300bp will be discarded (kind of the opposite of what you described in your last post).

That would explain why you're not getting many sequences back out of Deblur, because most of your reads are shorter than the trim length, so they're not being used. I think this is what @wasade meant when he asked if the trim length was too long and causing reads to drop.

Your quality scores look really good throughout, so if your minimum sequence length is 249, maybe you could try setting your trim length to around 249 and then you should retain a lot more sequences?

Lu_Yang · December 5, 2017, 12:35am

Hi, @Matilda_H-D,

Your suggestions are really helpful!
But another problem pops up. I have tried to set --p-trim-length 100, also --p-trim-length 200, they gave me different results, shown below. Since my input joined reads are all have merged with relatively high quality. Here just consider without whether FLASH merged well or not. May I get more information from you? May I know how to set the number is the most appropriate?
Thanks in advance.

Matilda_H-D · December 5, 2017, 1:04am

I'm just a user trying to work this stuff out too, so you might want to wait and hear what the developers have to say as well. But here are my thoughts.

As far as I know, the current advice for selecting a trim length is pretty much that it depends on the individual case and that there are a number of factors to balance -- no perfect answer! Some of the things we have to think about:

your range of sequence lengths -- you don't want to lose too many shorter sequences by setting the trim length too high, but you also don't want to cut out a lot of information from your longer reads by setting the trim length too low
your quality scores along the read -- often the quality scores will drop towards the end of a read and you will want to trim off the low-quality bases, again without losing too much information (it doesn't look like this will be a problem with your dataset, though, as the quality scores stay really high)

For these reasons, you hope that your dataset will consist of reads with a similar length and similar quality scores along the whole read, as it makes trimming easy. If not, you have to choose a balance.

If you have time, you can always test out a couple of different trimming lengths and see whether it changes your results.

Having said that, I don't see really huge differences between your results using a trim length of 100 and 200 -- the unique 16S reads identified (unique-reads-hit-reference) are pretty similar. Can you please clarify which trim length is which for the results that you posted?

Also, the fraction of artifactual reads has dropped a lot with the new trimming strategy, so that's good!

Lu_Yang · December 5, 2017, 1:15am

Hi, @Matilda_H-D,
Thanks for your details explain. It gives me more understanding. Let's wait @wasade give us a better and sure understanding.

The pic above, the 1st is the 100 trim, while the second is the 200 trim.
Again thanks so much for your help.

wasade · December 5, 2017, 5:18pm

Hi @Lu_Yang, reads greater than the trim length will be trimmed to the trim length. Reads shorter than the trim length will be dropped. Given the how noisy the data seem, it may make sense to try a trim of say 100nt first to see if the results make more sense, and if they do, then try a longer trim.

Best,
Daniel

wasade · December 5, 2017, 5:19pm

@Matilda_H-D's responses were great! I hadn't read through the full thread before I posted the last response, I apologize. I agree that the results seem to make more sense with the shorter trim.

Best,
Daniel

Lu_Yang · December 5, 2017, 5:21pm

Hi, @wasade,
Great! Thanks so much for your confirmation. Also, great thanks to @Matilda_H-D. I will continue work on it.

system · January 5, 2018, 11:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.