Can you pre-filter data by raw sequence read length before denoising/truncating/clustering?

reige012 · October 15, 2018, 8:33pm

Hello,

I have a set of sequences for a host-associated microbiome. I've determined that some of my raw reads that are actually host mitochondrial DNA are coming back after denoising/classification as "Bacteria" with no further designations. When I blasted these raw reads against the NCBI database they are in fact host mitochondrial DNA that is somehow slipping through and being IDed as Bacteria (via Silva database) by QIIME.

This is concerning for many reasons, but to get to my question. I have discovered that the host DNA have a slightly longer, but standard raw sequence length than the bacteria V4 region I'm interested in. Thus, I'm thinking I may be able eliminate this issue if I can basically filter out these longer reads earlier in the process. I've scoured but can't seem to find a command for this. Can you please advise me if something like this exists?

Thank you,
Alicia

Yos.Dos · October 15, 2018, 9:20pm

Hey, @reige012
If i well understand, you are quite sure about the length of your reads,
Then i think you should use Trimommatic with the 'MINLEN:length' flag: USADELLAB.org - Trimmomatic: A flexible read trimming tool for Illumina NGS data
unless there is a better way...

Mehrbod_Estaki · October 15, 2018, 11:33pm

Hi @reige012,
The issue of host contamination is common enough, especially in samples from low microbe: high host DNA environment. In my experience, even mouse colon tissues which have loads of microbes can suffer from this if the extraction protocol heavily disrupts the host cells. Anecdotally I see this especially true around the V4 region. Instead of trimming these prior to denoising (and risk introducing bias) I would recommend simply filtering these host-associated sequences from your feature table after the fact. In Deblur this is done automatically using a positive-filter based on greengenes database but you would have to do this manually if you are using DADA2 for denoisig.
This has been discussed on the forum before, most recently here, and you can also see a lengthier discussion of it here. Both links will have examples of how to do the filtering I believe. Let us know if that helps.

reige012 · October 16, 2018, 2:21pm

Hi @Mehrbod_Estaki,

Thanks for your reply. I think this doesn't necessarily answer my problem. I'm not thinking about trimming them before denoising. What I would like to do is before denoising I'd like to exclude raw reads of a specific length. All of my host DNA that sneaks through is of a specific bp length and all of my bacterial DNA is another length. I want to eliminate the host DNA from slipping through and subsequently being taxonomically IDed as "Bacteria" by QIIME. I thought if I could keep those very specific length reads from even entering into the denoising then it might solve my problem.

Does that clarify what I'm asking for?

reige012 · October 16, 2018, 2:26pm

Thanks @Yos.Dos. I'm not sure I want to trim anything just yet. I want to eliminate specific reads of a very specific bp length. I was hoping there was a way to do that in QIIME, but perhaps not. I may need to use R or Python to eliminate them if no one knows any QIIME commands.

Nicholas_Bokulich · October 16, 2018, 2:36pm

Hi @reige012!
Short answer: no, there is not a QIIME 2 method for size exclusion (except for minimum length), but I am not entirely convinced that's what you need here.

A slight difference would worry me for size exclusion. There is some natural length heterogeneity in the 16S (including V4), so I would personally want a pretty clear margin for size exclusion.

You make a good point about wanting to remove these reads as early as possible. So size exclusion is an interesting idea, but we just have not yet had any use for this in QIIME 2 or any user demand, so no QIIME 2 method exists yet.

You have some other options:

I suppose there are no mitochondrial reads in SILVA — in greengenes, mitochondrial and plastid 16S are annotated properly, allowing explicit taxonomy-based filtering. I would recommend filtering out anything that does not classify beyond kingdom/phylum level, anyway (these are usually non-target sequences in my experience), so you can just perform such filtering with your SILVA results.

You can also exclude by alignment against a reference, e.g., of mitochondrial sequences. But that method only handles sequences after denoising, not fastq reads.

If you put together some code to do this, it could be a great contribution to QIIME 2

We have not had any demand for a size exclusion method, but I am sure this could be useful for others.

system · November 16, 2018, 8:36pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.