Filtering out ASVs from DADA2 based on length

dada2
(Camilla Nesbo) #1

Hi,

I am wondering an option to filter out features (ASVs) based on lenght has been added.
I found some discussion about it (Comparison of DADA2 with Deblur) but could not find if a solution to this anywhere.

Thanks,
Camilla

0 Likes

(Devon O'rourke) #2

Hi @camillaln,
I’m not sure if you’re talking about 16S data, ITS data, or something else, but I’ve encountered a similar question about how to process my COI amplicons. One thing I’ve seen in the forum posted before is that it’s dangerous to filter for a specific single value because it ignores/discards the (likely) true biological variation that exists in the marker gene as well as whatever bioinformatic noise you might be hoping to remove.
In the link you provided there is a function qiime deblur denoise-16S with the parameter –o-stats trimmed/deblur/deblur-stats.qza being passed that is used to then generate the subsequent data table showing the frequency of the lengths of the features. Unfortunately that doesn’t show you the length of each feature, and I don’t know of any QIIME function to do that.
You could however export the –o-representative-sequences artifact into a fasta file, then count the lengths of each feature with whatever program you want (Python, R)… here’s one way to do in using just bash:

cat dna-sequences.fasta | paste - - | sed 's/>//' | awk '{ print $1,length($2) }' | sort -k2,2n

You’ll get a two-column table of the feature ID and the length, sorted with the shortest at the top of the list. That might give you a sense of how many (and which) features are items you think are of interest to filter out. Once you make a list (droplist.txt) of those features you want to remove, then you can use existing QIIME functions to create the filtered table and sequence .qza's you want:

qiime feature-table filter-features \
--i-table original.table.qza \
--m-metadata-file droplist.txt \
--p-exclude-ids \
--o-filtered-table filtered.table.qza

qiime feature-table filter-seqs \
--i-data original.seqs.qza \
--m-metadata-file droplist.txt \
--p-exclude-ids \
--o-filtered-data filtered.seqs.qza
4 Likes

(Camilla Nesbo) #3

Thank you! This is really helpful,
Camilla

0 Likes

(Matthew Ryan Dillon) #4

Just wanted to toss out one more option for length-based filtering:

qiime feature-table filter-seqs \
    --i-data seqs.qza \
    --m-metadata-file seqs.qza \
    --p-where 'length(sequence) > 4' \
    --o-filtered-data filtered-seqs.qza 

This works by viewing the FeatureData[Sequence] as Metadata, and then using a SQL-where clause to compute the length. The advantage here is then you don’t have to export anything. This command above would keep any sequences longer than 4 nts.

:qiime2:

3 Likes

(Devon O'rourke) #5

I just wanted to show off my Bash chops @thermokarst - way to burst my bubble with your one liner! :smile:

2 Likes