DEBLUR -p-trim-length

Hi, community,

I have a question on the --p-trim-length in deblur. I merge the PE by vsearch join pairs. I have no idea to trim the sequence at which length is fine. So I tried at different length.

I can not understand why my sequences all dropped, and the final sequences are only thousands. Based on the MOVING PICTURE, I know the trim length is subjective based on the point when the quality score drop. And I have checked some related topics in the forum. Still can not find the answer.

Why this is inverse to the actual result of mine? Looking forward to get the answer which length is appropriate. Confusing.

Millions of thanks.
Enclosed is my result in different length.
Archive.zip (3.0 MB)

Hi @Brandon,

Any sequence that contains fewer nucleotides than the trim length will be dropped. The trim length is important in if read lengths are heterogeneous in your data as the Deblur algorithm requires all reads are of the same length, or if you are performing a meta-analysis between studies of different read length and you want to normalize that study effect. You should be able to get an idea of the read lengths through qiime demux summarize if that is unclear. Does that help?

Best,
Daniel

1 Like

Hi, @wasade ,

Thanks for your answer. Indeed I want to do meta-analysis of reads from the same primer, some reads are from EMP dataset, some reads are from other datasets, while some are PE, some are SE. Therefore, some are longer reads, while others are shorter.

I prepare to use DEBLUR to analysis them separately and merge the rep-seqs.qza into a big one, also the table.qza into a big one. Then use q2-fragment-insert to create the tree and pick taxa together. In the above procedure, different length and also quality of different datasets drive me not to trim the reads at the same parameter --p-trim-length. But I am not sure if that is reasonable or not.

As you mentioned, normalize the study effect, does that means to trim at the same length? OR any other benchmark can evaluate it?

Really appreciate hearing more suggestions from you.

Thanks.

1 Like

Thank you for the additional detail! I recommend trimming reads to the same length (i.e., to the shortest study) as the read length can contribute to study bias. There is some discussion on this in Debelius et al which may be of interest if haven’t seen it as the manuscript surrounds meta-analysis. If your data are spanning multiple primers then it may be worth exploring trimming followed by closed reference OTU picking. You may also want to track the sample -> study associations in the sample metadata so that you can test if, for instance, the effect size of sample relationships are better explained by study than by a covariate of interest.

Best,
Daniel

Hi, @wasade,

I see. Thanks for the paper. I have read the paper you sent, it makes sense to me. Really helpful.

I used a dataset(contains two primers 341-785, and 357-806) tested by closed reference OTU picking. But the results do not make sense. I am wondering whether I was wrong at some point. My procedure is as follows.

(1) I use the moving picture protocol of DEBLUR produce the rep-seqs.qza and table.qza

(2) I train the classifier follows(I choose the border section of the two primers) Training feature classifiers with q2-feature-classifier — QIIME 2 2017.10.0 documentation
I got my classifier-341-806.qza, ref-seqs.qza.
(3)

qiime vsearch cluster-features-closed-reference --i-table table.qza --i-sequences rep-seqs.qza --i-reference-sequences ref-seqs.qza --p-perc-identity 1 --o-clustered-table table_99.qza --o-unmatched-sequences unmatched.qza --o-clustered-sequences seqs_99.qza --p-threads 4

(4)

qiime feature-classifier classify-sklearn --i-classifier classifier-341-806.qza --i-reads seqs_99.qza --o-classification taxonomy_99_vsearch.qza

Finally I got the biom file with only 52 represented seqeuences, 52 OTUs. With 10%~30% unknown bacterias.

May I know the reason? Does my procedure wrong?

Thanks.

Hi @Brandon,

I just want to make sure I understand the flow here.

First, is each sample composed of sequence data from multiple primers, or do samples use a single primer where some samples used 341f-785r and some used 357f-806r?

Second, just to verify, the input data were deblur’d. The same input sequence data were then run through a closed reference OTU picking, using the representative sequences of the Deblur process as the reference database. Is that accurate? If so, then it may be worth testing an alternative process. For instance, one strategy would be to take all of your input data and run them through a closed reference approach against an existing 16S reference database like SILVA or Greengenes.

The reason I’m inquiring about the first one is if the sample are single primer then you should be set to test for primer effects down the line by using the primer as a categorical variable. The second is because I don’t think the nesting of the OTU methods is necessary. For instance, in Qiita when we integrate across primers, we just use closed reference OTU picking against Greengenes at 97%, which was the strategy used in Debelius et al for the HMP data. There will be a primer effect, so I recommend testing for it if feasible. But, hopefully the biological question you’re asking has as stronger effect than the primers.

Best,
Daniel

Hi, @wasade,

Thanks for your suggestions.

First, is each sample composed of sequence data from multiple primers, or do samples use a single primer where some samples used 341f-785r and some used 357f-806r?

Yes, half of the samples were sequenced by 341F-785R primer, while the rest were sequenced by 357F-806R primers.

Second, just to verify, the input data were deblur’d. The same input sequence data were then run through a closed reference OTU picking, using the representative sequences of the Deblur process as the reference database. Is that accurate?

Yes, it is right. The input data were deblur'd. rep-seqs-1.qza and table-1.qza were produced from 341F-785R, rep-seqs-2.qza and table-2.qza were produced from 357F-806R.

If so, then it may be worth testing an alternative process. For instance, one strategy would be to take all of your input data and run them through a closed reference approach against an existing 16S reference database like SILVA or Greengenes.
For instance,in Qiita when we integrate across primers, we just use closed reference OTU picking against Greengenes at 97%, which was the strategy used in Debelius et al for the HMP data.

I want to make sure I understand your suggestions. I import the raw reads into qiime2, join the reads from different primers respectively, qiime quality-filter q-score respectively, qiime vsearch dereplicate-sequences, merge the seqs1.qza and seqs2.qza into one seqs.qza, merge table1.qza and table2.qza respectively, then do the vsearch closed reference OTU picking with the following code,

qiime vsearch cluster-features-closed-reference --i-table table.qza --i-sequences rep-seqs.qza --i-reference-sequences 99_otus.qza --p-perc-identity 0.97 --o-clustered-table table-cr-99.qza --o-unmatched-sequences unmatched.qza --o-clustered-sequences seqs_99.qza --p-threads 4

Are these procedure correct? Shall I need to check the chimeras by qiime vsearch uchime-denovo BEFORE closed-reference OTU picking?
And there are three taxonomy assign methods, (1)classify-consensus-blast: BLAST+ consensus taxonomy classifier, (2)classify-consensus-vsearch: VSEARCH consensus taxonomy classifier, (3)classify-sklearn: Pre-fitted sklearn-based taxonomy classifier.
I have tried

qiime feature-classifier classify-consensus-vsearch --i-query rep-seqs.qza --i-reference-taxonomy ref-taxonomy-99.qza --i-reference-reads 99_otus.qza --p-maxaccepts 2 --p-perc-identity 0.97 --o-classification taxonomy-consensus.qza --p-threads 4

However, 6 samples test, 5 hours have passed. Still running.
I saw different suggestions in different places. Like here and here

May I get some suggestions of which taxonomy assign method shall I use? What are the differences between them?

There will be a primer effect, so I recommend testing for it if feasible.

May I know how to test the primer effect? What does that means?

Thanks so much for the patient.:stuck_out_tongue:

Best.

Brandon

Hi @Brandon,

Thank you for the additional information. I just looked back at qiime vsearch cluster-features-closed-reference and saw that it is not a general purpose clustering method, and is different than what the classic closed reference OTU picking is in QIIME1 which is what I’m more used too. Anyway, I think your approach before largely makes sense, but the two changes you may want to employ are a) using a relaxed similarity threshold when clustering if you’re obtaining far fewer features than what you expect and b) assigning taxonomy per primer as opposed to the merged table. I do not think you need to check for chimeras since DADA2 and/or Deblur both already implement a bimera filter.

An example of the primer effect can be found in figure 1 of Debelius et al. I recommend in your mapping file describing which sample used which primer, and making sure to assess that categorical variable when exploring for significant differences in your data.

Best,
Daniel

Hi, @wasade,

Thanks for the additional suggestions. Appreciated it! Some more questions.

b) assigning taxonomy per primer as opposed to the merged table.

Does that mean to assign the taxonomy separately, and compare them?
May I know how can I get the same OTU number and taxonomy if I want to compare these data. Just for my understanding, if I pick taxonomy for two primers separately, the OTU number corresponding to the taxonomy should be different?

May I know which taxonomy assign methods is better for me to use? (1)classify-consensus-blast: BLAST+ consensus taxonomy classifier, (2)classify-consensus-vsearch: VSEARCH consensus taxonomy classifier, (3)classify-sklearn: Pre-fitted sklearn-based taxonomy classifier, as what we did in 'moving picture'.

Thanks so much.
Again, thanks for the patient.

Best.

Brandon

I’m not sure if a classifier trained on both regions will impact sensitivity. You should be able to assign taxonomy separately and merge the taxonomy files, but I confess I’m not entirely sure how that would be done using QIIME2, @jairideout do you know by chance? It is entirely possible that different OTU IDs will be observed when assigning taxonomy but that does not mean the lineage information will differ dramatically.

I’m assuming the naive Bayes classifier makes the most sense here, but I’m not sure if @Nicholas_Bokulich would like to weigh in or not.

Best,
Daniel

1 Like

It would probably have a small effect (for 16S... other marker genes might matter more), but I'm really not sure. It's going to vary case by case so I don't know about these regions specifically.

While I'd predict a small effect, I'd predict it would be strongest for the naive Bayes classifier (since that's the one that benefits slightly from trimming reference sequences to the target gene).

So I think that classify-consensus-blast or classify-consensus-vsearch would probably be the easiest/most transparent to use here @Brandon . It would still help to trim your reference sequences (to cut down on runtime, mostly).

Honestly, that's probably the easier thing to do, rather than splitting by sub-region/classifying separately, then merging back together.

I hope that helps!

1 Like

Hi, @wasade and @Nicholas_Bokulich,

Thanks so much for the help along the way. I will try that and have a see.

Best.

Brandon

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.