Feature-table merge and merge-seqs does not eliminate duplicate sequences

Nicholas_Bokulich · October 31, 2018, 9:27pm

Hi @Byron_C_Crump,
The “easy” way to do this would be to go back and re-process your own data with deblur, but add the --p-no-hashed-feature-ids parameter, which will cause feature IDs to be unhashed (i.e., the sequence will become its own feature ID and match the QIITA data you have).

But I will offer another solution that I hope will save some time. Deblur has caused you enough headaches so I slapped this together in hope that it helps! There are more elegant ways to do this, but a bash one-liner is probably easiest for you to use right now.

Use qiime metadata tabulate and download your sequences as metadata. The file should look something like this:

Feature ID   Sequence
#q2:types	categorical
ACTGATCGATCG         ACTGATCGATCG
ACTGATCTTTCG         ACTGATCTTTCG
ACTGGGGGCTCG         ACTGGGGGCTCG

Remove the first two lines of that file.
Run the following command in your terminal (alter the filepaths)

while read EachLine
do
   id=$(echo $EachLine | cut -f 1 -d ' ')
   newid=$(echo $id | md5)
   echo "$id $newid" | tr ' ' '\t' >> feature_id_map.tsv
done < input_sequences_as_metadata.tsv

This will create a file that maps your sequences to their md5 hashes. Something like this:

ACTGATCGATCG     46c363d67c1b8ced9e320081ad09914f
ACTGATCTTTCG     867a92b54ad55292e5e88660238ac920
ACTGGGGGCTCG     69a84eec85419cea96eece49ae926ea5

Run the following command:

qiime feature-table group \
    --i-table table.qza \
    --m-metadata-file feature_id_map.tsv \
    --p-axis feature \
    --o-grouped-table grouped-table.qza

I think that should do the trick of relabeling the feature IDs in your feature table. This should then be mergeable with your other feature table. But there will probably be some kinks to iron out, e.g., you will probably need to add a header line to your feature_id_map.tsv file. I have not tested this all the way through.

Let us know if you get stuck and we’ll give you a hand!