Adjustment of max error rate for long read data

jholmes5 · May 21, 2020, 8:01pm

Hi,

We are trying to analyze Loop Genomics data within QIIME2 (v. 20.20.2) which is full length 16S sequences. The resultant data are assembled contigs of high accuracy. We also noted that the quality scores from the demux summary were very high across reads (> 30 phred score), and so we only trimmed down to 1500 nucleotides to ensure that most reads were of a similar length. We tried running dada2 denoise-single with the command below, but it ended up filtering all but 5% of our reads, which is much too low. Do we need to adjust the value of --p-max-ee to a higher value? We don't have the ability to do experimental testing on a mock community, but we were hoping that someone may have encountered long read data before and have a rough recommendation on what to set this value at. Or if there are additional things we should try instead?

qiime dada2 denoise-single \
  --i-demultiplexed-seqs single-end-demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 1500 \
  --o-representative-sequences rep-seqs-dada2.qza \
  --o-table table-dada2.qza \
  --o-denoising-stats stats-dada2.qza

Thank you,
Jessica

llenzi · May 21, 2020, 9:11pm

Hi @jholmes5

Welcome on the forum!
I am very curious on your results because we are testing loop genomics as well!
I have only one though about you question: dada2 may be not the suitable tools given you using assembled contigs. My main concern is that the quality scores of the assembled sequences do not meet the criteria for the dada2 error model assumptions. I therefore would suggest to test deblur instead dada2.
Hope it helps!

Luca

Mehrbod_Estaki · May 22, 2020, 7:55am

Hi @jholmes5,
I think what @llenzi says is right that the current q2-dada2 settings are not suitable for your loop genomics data for various reasons, the maxEE being unreasonably too low for 1500 nt length reads is just one of them. However, I will also add that Deblur will not be a good solution either as its error model is pre-trained based on error profiles of Illumina short reads (V4) and I'm afraid with 1500 nt length sequences you may end up with even less than 5% of your reads given how conservative the algorithm gets as reads get longer.
To be honest I had never heard of Loop Genomics until now, and I haven't really dug into it to see how they use Illumina short-read sequencing to get full V1-9 lengths reads. What does the company recommend using for this type of data?
Also, the dada2 developer @benjjneb has implemented an algorithm specifically for long reads using PacBio data (paper here) which is currently not available via the QIIME 2 plugin but may be in stand alone package in R. Though I'm not sure it would be suitable here but I'll let Ben comment further if it would be.

jholmes5 · May 22, 2020, 2:49pm

Thank you all for your help! I'm not entirely certain on what Loop Genomics does behind the scenes, but they do provide their own count and taxonomy data and they say they use QIIME to do this but provide no specifics. So we were trying to do the analysis ourselves from scratch, but based on what you're telling me, I think it might be easier to just import their results into QIIME2 and go from there. Otherwise, the dada2 PacBio algorithm also sounds promising.

Mehrbod_Estaki · May 22, 2020, 3:44pm

Hi @jholmes5,
I'm guessing they give you OTU clustered data, which is not ideal, but workable. You can do your own clustering in QIIME 2 with vsearch if you wanted. But I am curious if there is a way to get ASVs out of this type of data.

llenzi · May 22, 2020, 6:02pm

Hi,
I agree, it may be the easy think to import what they provide in Qiime2 for post analysis.
My understanding of the Loop genomic pipeline is that they return OTU after clustering at 100% similarity.
I am not sure if there is a denoising step before or after the clustering, the last time I looked into this their pipeline was still bit of work in progress. The dada2 is mentioned in the result set I have but was actually abandoned in the version of the pipeline i used. I think QIIME2 is relative only to the fact that they use qiime2-formatted Silva database (which may still be handy for importing in proper Qiime2 ).
Taxonomy assigned by BLAT.
As I said we had a go with their kit last year but we had few issue and abandoned since.
I have to add that this is my understanding from looking at the results and few discussion around, based on information form last year, so to-date it may be totally wrong or obsolete.

Luca

benjjneb · May 23, 2020, 1:33pm

jholmes5:

We are trying to analyze Loop Genomics data within QIIME2 (v. 20.20.2) which is full length 16S sequences. The resultant data are assembled contigs of high accuracy. We also noted that the quality scores from the demux summary were very high across reads (> 30 phred score), and so we only trimmed down to 1500 nucleotides to ensure that most reads were of a similar length. We tried running dada2 denoise-single with the command below, but it ended up filtering all but 5% of our reads, which is much too low. Do we need to adjust the value of --p-max-ee to a higher value? We don’t have the ability to do experimental testing on a mock community, but we were hoping that someone may have encountered long read data before and have a rough recommendation on what to set this value at. Or if there are additional things we should try instead?

We should have a preprint up on Loop Genomics data pretty soon, which will include some updated guidance on using DADA2 to process LoopSeq data. In our testing it looks very promising, and DADA2 works almost out-of-the-box with it, but the q2-dada2 plugin may also need updating to handle it ideally.

It is necessary to pre-process the Loop data before q2-dada2 to cut the sequences down to the region between the forward and reverse primers. If you haven't done that, you could try that, and then see if the plugin results significantly improve.

system · June 23, 2020, 8:18pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.