Feature-table merge and merge-seqs does not eliminate duplicate sequences

Byron_C_Crump · October 30, 2018, 11:17pm

I merged two datasets that were each run through deblur and cut to 100bp sequences following this tutorial:
https://docs.qiime2.org/2018.4/tutorials/fmt/#merging-denoised-data

Then when I calculated beta-diversity, all samples from the two datasets were 0% similar to one another. I figured out that when I used merge-seqs, the program did not eliminate duplicate sequences, so one dataset used one set of sequences and the other dataset used the other set of sequences, even though they were merged. Does that make sense?

Did I miss a step somewhere? The names of the sequences in the two datasets are different. Does merge-seqs use the sequence names to eliminate duplicates rather than the sequences themselves?

Byron_C_Crump · October 30, 2018, 11:43pm

After further reading I learned that one of my datasets has MD5 hashes for feature IDs (i.e., the names of the DNA sequences) and the other dataset uses the DNA sequences themselves as feature IDs. The latter of these datasets is a "deblurred" dataset I downloaded from the Earth Microbiome Project FTP site. How can I change the feature IDs from the DNA sequences to the MD5 hashes in the EMP dataset so that they it be merged properly with my own dataset? Note that I cannot deblur the EMP dataset myself.

Nicholas_Bokulich · October 31, 2018, 9:27pm

Hi @Byron_C_Crump,
The "easy" way to do this would be to go back and re-process your own data with deblur, but add the --p-no-hashed-feature-ids parameter, which will cause feature IDs to be unhashed (i.e., the sequence will become its own feature ID and match the QIITA data you have).

But I will offer another solution that I hope will save some time. Deblur has caused you enough headaches so I slapped this together in hope that it helps! There are more elegant ways to do this, but a bash one-liner is probably easiest for you to use right now.

Use qiime metadata tabulate and download your sequences as metadata. The file should look something like this:

Feature ID   Sequence
#q2:types	categorical
ACTGATCGATCG         ACTGATCGATCG
ACTGATCTTTCG         ACTGATCTTTCG
ACTGGGGGCTCG         ACTGGGGGCTCG

Remove the first two lines of that file.
Run the following command in your terminal (alter the filepaths)

while read EachLine
do
   id=$(echo $EachLine | cut -f 1 -d ' ')
   newid=$(echo $id | md5)
   echo "$id $newid" | tr ' ' '\t' >> feature_id_map.tsv
done < input_sequences_as_metadata.tsv

This will create a file that maps your sequences to their md5 hashes. Something like this:

ACTGATCGATCG     46c363d67c1b8ced9e320081ad09914f
ACTGATCTTTCG     867a92b54ad55292e5e88660238ac920
ACTGGGGGCTCG     69a84eec85419cea96eece49ae926ea5

Run the following command:

qiime feature-table group \
    --i-table table.qza \
    --m-metadata-file feature_id_map.tsv \
    --p-axis feature \
    --o-grouped-table grouped-table.qza

I think that should do the trick of relabeling the feature IDs in your feature table. This should then be mergeable with your other feature table. But there will probably be some kinks to iron out, e.g., you will probably need to add a header line to your feature_id_map.tsv file. I have not tested this all the way through.

Let us know if you get stuck and we'll give you a hand!

Byron_C_Crump · November 2, 2018, 11:58am

Thanks for this! It worked up to the last command. I had to add headers "feature-id" for the first column and "md5" for the second column to get the command to start.

I also found that the "qiime feature-table group" command requires two other parameters: --m-metadata-column and --p-mode. So I added these and ran this command:

qiime feature-table group
--i-table emp_deblur_100bp.release1.biom.qza
--m-metadata-file emp.100.min25.deblur.seq.fa.feature_id_map.tsv
--p-axis feature
--o-grouped-table emp_deblur_100bp.release1.md5.biom.qza
--m-metadata-column Sequence
--p-mode sum

It has been running for 24 hours now on a Mac and on a 20-core server. I also started a new deblur run on the dataset that had hashes for feature IDs using the --p-no-hashed-feature-ids parameter. I'll check in when these commands finish.

Nicholas_Bokulich · November 2, 2018, 12:36pm

I think that should be "md5" instead — you are "grouping" on that column and relabeling your sample IDs by group. Since that column contains all new values, it effectively allows you to relabel your samples. Setting to "sequence" will just relabel with the original label.

The # of cores will not matter, since this is not being run in parallel. It will be a time-consuming step since you are relabeling a large # of samples, though 24 hr does seem surprising. Still, if it's running that probably means it is working.

Let us know when something either succeeds or fails!

Byron_C_Crump · November 6, 2018, 2:14pm

Sorry I had that wrong. I actually did use "--m-metadata-column md5" in the qiime feature-table group command line. It has been running a few days and it has not finished. In the mean time I'm using the other strategy of using deblur on my own sequences with the --p-no-hashed-feature-ids parameter so that the feature IDs match those in the EMP dataset. I merged my dataset with the EMP dataset, and am working my way through all the commands. I'll keep you updated.

Thank you again for trying to help me fix the feature IDs in the EMP dataset. I don't know why the qiime feature-table group command line is not working. I guess I gave up.

I still want to figure out why I cannot use deblur or dada2 on the EMP dataset, and I'm concerned about inconsistencies in the deblurring procedures of my sequences and the pre-deblurred EMP sequences that Im merging them with. But I don't know how to investigate that issue and I'm hoping that it will not express itself in my final dataset.

I've spent a couple months struggling with QIIME2 and trying to analyze this dataset, and I appreciate all the help Ive received from this forum. I think I would be farther along if I did not have to use EMP data. I'm going to move forward with the only pipeline that seems to be working. If I have success I will send updates to this post and my other (longer) post about deblur to wrap this up. Thanks again!

Nicholas_Bokulich · November 6, 2018, 2:23pm

It sounds like you have tons of samples and relabeling each one will be time-consuming. If it takes less time to re-deblur, just go that route!

I agree, the issues you reported on that other thread are concerning — in my opinion it seems like this boils down to issues with the EMP sequences, since we have never had issues like that reported before and since deblur was working fine on your own sequences (right?).

I propose just looking at the final data — e.g., create a PCoA plot and see if your samples cluster as expected against the EMP samples (I assume you are comparing similar sample types). Samling, study, and run bias and numerous other factors can make it so your samples will cluster out separately, but the same sample types should still be very close to each other (e.g., sister subclusters) if you are comparing against more dissimilar sample types.

I'm sorry deblur/EMP has been giving you so much grief! I agree, working with EMP seems like the stumbling block here and it is still a mystery to me what's going on. Please follow up in that separate topic — since this seems to be a deblur/EMP issue (as opposed to something else in QIIME 2), that thread already has all the relevant developers in the conversation.

But I hope everything goes smoothly from here!

system · December 7, 2018, 8:23pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.