Figuring out why sequences are lost in Deblur

David_Bradshaw · October 31, 2018, 4:05pm

Dear Whom It May Concern,

I just wanted to double check that I am running deblur correctly and how to interpret the stats that are given to know why I lost a good chuck of my sequences.

I am using the 515-806 primers from EMP, trimmed the primers and to max length of 250 using trim-galore and fastx, then went to QIIME2

I ran the quality filtering script as follows

qiime quality-filter q-score-joined
--i-demux reads_trimmed_joined.qza
--o-filtered sequences reads_trimmed_joined_filtered.qza
--o-stats reads_trimmed_joined_filter_stats.qza

Sequences went from 14,560,145 to 14,556,787 with that run (visualization attached). Based upon that visualization I chose a trim length of 242 when my minimum length was 243 according to the visualization. Then I ran the following script for deblur

qiime deblur denoise-16s
--i-demultiplexed-seqs reads_trimmed_joined_filtered.qza
--p-trim-length 242
--p-sample-stats
--p-jobs-to-start
--p-min-reads 1
--o-table deblur_table.qza
--o-representative-sequences deblur_rep_seqs.qza
--o-stats deblur_stats.qza

I went from 14,556,787 to 5,713,447 sequences (table visualization too big to attach). I am unsure how to interpret the deblur stats visualization (attached) in order to find out why my sequences were dumped. I would appreciate any assitance in interpreting it as well as indicating if I did anything wrong to lose so many sequences.

Thank you for your time and help.

Sincerelydeblur-stats.qzv (221.4 KB)
reads_trimmed_joined_filtered.qzv (300.1 KB)
,

David Bradshaw

Nicholas_Bokulich · October 31, 2018, 9:40pm

Hi @David_Bradshaw,
In the deblur stats you can mouse over the column headings to see a description of that column.

In your case, it looks like the issue is that a large number of reads are being filtered out as potential artifact. See the "fraction-artifact-with-minsize" column.

You should re-examine your sequences, make sure all adapters are being trimmed. I wonder if trimgalore and fastx could also be somehow involved... note that q2-cutadapt can be used for trimming primers and adapters inside QIIME 2.

I hope that helps!

David_Bradshaw · November 1, 2018, 4:15pm

Dear Dr. Bokulich,

Thank you for your reply. We attempted to use cutadapt before and it did not actually remove all of the primers, so we switched to trimgalore to just remove the number of bases associated with each primer instead of looking for the sequence itself. To my knowledge the sequence center removes the barcodes and adapters before sending it to us. Should I try to look anyways? If so what script in cutadapt would you suggest running? Thank you for your time and help.

Sincerely,

David Bradshaw

Nicholas_Bokulich · November 1, 2018, 4:33pm

If you wanted to use dada2 instead of deblur, there is a trim parameter that will do this for you, making this considerably easier. Unfortunately, deblur does not (yet) have such a parameter.

That is usually what happens, so it is probably okay. But I would recommend looking just in case. A large number of sequences is being filtered as artifact, so a quick sanity check may save backtracking later.

The easiest thing to do may just be to look at your sequences before you import them to QIIME 2 — you can run blastn or another local aligner to see if you can find fragments of your adapter in your sequences.

Another possibility is that there is something wrong with the read joining — you may want to review your method/parameters to make sure that low-quality joins are excluded.

trim-paired for paired-end reads, or trim-single for single-end (or joined) reads.

David_Bradshaw · November 5, 2018, 7:31pm

Dear Dr. Bokulich,

Thanks for the help. We like the way Deblur does analysis more than DADA2 but thank you for the suggestion.

Is losing that many sequences, 14.5 million to 5.7 million so greater than 50%, typically what you see with Deblur?

I will double check for adapters. What would be other reasons why something would be flagged as artifacts? We essentially ran the following script on trim-galore. I twas automatically finding the illumina adapter in the sequences, so we put X to prevent that. but when we searched a representative sample for it we could not find it (AGATCGGAAGAGC). Maybe that is where things are going wrong?

"trim_galore -a X --fastqc -q 20 --length 20 --clip_R1 19 --clip_R2 20 --dont_gzip -o dirname {input[0]} --paired {input[0]} {input[1]}"

Thank you for your time and help.

Sincerely,

David

wasade · November 5, 2018, 9:06pm

Dear David,

Thank you for sending on the stats output at the seed of this thread. It looks like, in some samples, the number of singletons is rather high. This is affected by --p-min-size. You can how --p-min-size is used here where it is handed off to vsearch as part of the dereplication.

These sequences appear to be flagged as artifacts because they are singletons prior to applying Deblur.

It is possible that those samples have a very high number of truly unique sequences. It's also possible that those samples have a high amount of error creating a large number of unique sequences. It's important to note that reads which pass the quality filter are not assured to be error free.

What I recommend is running the forward read through without joining and see how your results look. For many analyses, the forward read is sufficient anyway, and the reverse read tends to have more error. And if using just the forward read is sufficient for the questions you're asking, it may be feasible to proceed from there. If you want to use these singletons with Deblur, I believe that is possible to do so by setting --p-min-size=0.

In terms of what is normal, I'm observing 885,044 singletons at 150nt out of 12M raw reads on the latest American Gut MiSeq run. I've never had a need to join reads, so I cannot comment on what the number of singletons would be if operating on joined data.

Best,
Daniel

David_Bradshaw · November 13, 2018, 6:24pm

Dear Dr. McDonald and Dr. Bokulich,

Thank you both for your advice. I have tested the forwards reads and used Deblur at multiple lengths to see how the results change, I am unsure of the best cut off based upon the graph (any suggestions?) reads_trimmed_forwards_filtered.qzv (295.4 KB). I used 242 in the joined reads because the lowest length was 243 and the graph looked great reads_trimmed_joined_filtered.qzv (300.1 KB).

The number of singletons is very high, and I am happy to leave it at the defaults going forward. My sequencing depth was not sufficient enough to keep those in confidence.

Also on a semi unrelated note, would you toss out any samples based upon a sequences cutoff? I have been unable to find a specific number in a brief literature review. If so what number? These are sediment samples if that changes anything.

Thank you both very much.

Sincerely,

David

David_Bradshaw · November 13, 2018, 7:20pm

Dear Dr. Bokulich,

Another side question, sorry. In your paper Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing, you suggest filtering out any OTUs that do not have enough sequences across all samples to represent at least 0.005% of the total amount of sequences. Would you suggest using a similar threshold with Deblur if I did not have a mock community?

Sincerely,

David

Nicholas_Bokulich · November 13, 2018, 7:50pm

The forward read quality looks pretty good — 242 sounds reasonable but see how deblur behaves.

deblur already does that depth filtering for you. As far as I know this is already set based on benchmarks performed for deblur but maybe @wasade has more to add on that note.

No, I would not recommend that. Those recommendations were based on OTU clustering — deblur should already be removing most of that noise, so should not need additional filtering, or at least not the same depth filter. Let's see what @wasade has to recommend above.

David_Bradshaw · November 14, 2018, 5:46pm

Dear Dr. Bokulich,

Thank you very much for your help. I look forward to wasade's response, but I will continue with the analysis as you have suggested.

Sincerely,

David

wasade · November 20, 2018, 6:20pm

Hi @David_Bradshaw,

We generally drop samples that fall below the rarefaction level being used. And for some analyses (e.g., ANCOM), it is important irrespective of the OTU assessment method to filter out features that are not represented across many samples.

Best,
Daniel

David_Bradshaw · November 27, 2018, 5:34pm

Dear Dr. McDonald,

Thank you very much for your help.

Sincerely,

David Bradshaw

system · December 28, 2018, 11:35pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.