Filtering out ASVs from DADA2 based on length

camillaln · April 12, 2019, 1:48am

Hi,

I am wondering an option to filter out features (ASVs) based on lenght has been added.
I found some discussion about it (Comparison of DADA2 with Deblur) but could not find if a solution to this anywhere.

Thanks,
Camilla

devonorourke · April 12, 2019, 1:39pm

Hi @camillaln,
I'm not sure if you're talking about 16S data, ITS data, or something else, but I've encountered a similar question about how to process my COI amplicons. One thing I've seen in the forum posted before is that it's dangerous to filter for a specific single value because it ignores/discards the (likely) true biological variation that exists in the marker gene as well as whatever bioinformatic noise you might be hoping to remove.
In the link you provided there is a function qiime deblur denoise-16S with the parameter –o-stats trimmed/deblur/deblur-stats.qza being passed that is used to then generate the subsequent data table showing the frequency of the lengths of the features. Unfortunately that doesn't show you the length of each feature, and I don't know of any QIIME function to do that.
You could however export the –o-representative-sequences artifact into a fasta file, then count the lengths of each feature with whatever program you want (Python, R)... here's one way to do in using just bash:

cat dna-sequences.fasta | paste - - | sed 's/>//' | awk '{ print $1,length($2) }' | sort -k2,2n

You'll get a two-column table of the feature ID and the length, sorted with the shortest at the top of the list. That might give you a sense of how many (and which) features are items you think are of interest to filter out. Once you make a list (droplist.txt) of those features you want to remove, then you can use existing QIIME functions to create the filtered table and sequence .qza's you want:

qiime feature-table filter-features \
--i-table original.table.qza \
--m-metadata-file droplist.txt \
--p-exclude-ids \
--o-filtered-table filtered.table.qza

qiime feature-table filter-seqs \
--i-data original.seqs.qza \
--m-metadata-file droplist.txt \
--p-exclude-ids \
--o-filtered-data filtered.seqs.qza

camillaln · April 12, 2019, 5:11pm

Thank you! This is really helpful,
Camilla

thermokarst · April 12, 2019, 6:20pm

Just wanted to toss out one more option for length-based filtering:

qiime feature-table filter-seqs \
    --i-data seqs.qza \
    --m-metadata-file seqs.qza \
    --p-where 'length(sequence) > 4' \
    --o-filtered-data filtered-seqs.qza

This works by viewing the FeatureData[Sequence] as Metadata, and then using a SQL-where clause to compute the length. The advantage here is then you don't have to export anything. This command above would keep any sequences longer than 4 nts.

:qiime2:

devonorourke · April 12, 2019, 6:44pm

I just wanted to show off my Bash chops @thermokarst - way to burst my bubble with your one liner!