Different IDs when creating FeatureTable[Frequency] artifact and FeatureData[Sequence] artifact

appleyield · February 24, 2022, 1:12pm

Dear QIIME2-community,

I create a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact after importing a FASTA file using the command qiime vsearch dereplicate-sequences (as described here: Clustering sequences into OTUs using q2-vsearch — QIIME 2 2019.10.0 documentation). Unfortunately this adds the sample-ID behind the feature-ID in the FeatureData[Sequence] artifact:

9acae648366f4ef6a3b3f47740ff6c839288815f SRS014911.SRX020546_211858915

I need it this way: >9acae648366f4ef6a3b3f47740ff6c839288815f

This only happens in the FeatureData[Sequence] artifact, not in the FeatureTable[Frequency] artifact, which I both create with the same command. In later steps I run into problems because the IDs in both artifacts do not correspond.

Is there an option to tell the qiime vsearch dereplicate-sequences to not add sample-IDs behind the feature IDs? Question feature ids much be present as tip names in phylogeny - #7 by thermokarst touches the same topic, but doesn’t answer how to avoid adding the sample-IDs.

v13_sequences_small5_single1.qza (186.4 KB)

This is a small example read in FASTA file (this is data from the Human microbiome project, if someone also knows about a document where someone analyses this in QIIME2 please let me know). I read it in with:
qiime tools import --input-path v13_sequences_small5_single1.fna --output-path v13_sequences_small5_single1.qza --type 'SampleData[Sequences]'

Then I apply the vsearch command:
qiime vsearch dereplicate-sequences
--i-sequences v13_sequences_small5_single1.qza
--o-dereplicated-table table_v13_sequences_small5_single1.qza
--o-dereplicated-sequences rep-seqs_v13_sequences_small5_single1.qza

The files:
rep-seqs_v13_sequences_small5_single1.qza (206.7 KB)
table_v13_sequences_small5_single1.qza (95.4 KB)

My current workaround is to export the seq artefact to an R-readable format, delete the sample-IDs in R and import it back to QIIME. But I would like to avoid this step.

Thank you in advance
Lukas

colinbrislawn · March 1, 2022, 3:00am

Hello Lukas,

You can get around this by using vsearch directly. Using just --relabel_sha1 will make only that leading hash, and not the sample name and read number that follows it. Let us know if you need a hand crafting this command.

Vsearch supports a couple of different relabeling methods. These are covered in great detail in manual on the release page. I've summarized them here:

--relabel string
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new
headers. Use --sizeout to conserve the abundance annotations. (--sizeout works with many of the options too)

--relabel_keep
When relabelling, keep the old identifier in the header after a space

--relabel_md5 and --relabel_sha1
details in the vsearch manual

--relabel_self
Relabel sequences using each sequence itself as a label.

--xsize and --xee
Remove size and expected error labels from the headers

If you want to get clever, vsearch also supports reading and writing from pipes, so you could pipe to standard out and parse labels with sed or your unix tool of choice.

Mehrbod_Estaki · March 1, 2022, 8:26am

Another little hacky way around this is to use the feature-table to create a new rep-seqs file:

Export the biom table from the feature-table:

mkdir extracted-feature-table
qiime tools extract \
  --input-path feature-table.qza \
  --output-path extracted-feature-table

Then the below will give you a rep-seqs .fna file which then you can re-import back into QIIME 2.

biom summarize-table --observations -i extracted-feature-table/feature-table.biom \
| tail -n +16 | awk -F ':' '{print ">"$1"\n"$1}' > rep_seqs.fna
 
qiime tools import \
  --input-path rep_seqs.fna \
  --type 'FeatureData[Sequence]' \
  --output-path rep-seqs.qza

Note that the feature-table.biom name here is what I think is the biom table is named by default from the top of my head, but please double check that when running the code.

appleyield · March 5, 2022, 6:37am

Hi Colin,
thanks for the hint to use vsearch directly. I run the following command on my original fasta-file:

vsearch
--relabel_sha1
--derep_prefix $path$datafile$".fasta"
--output $path$datafile$"_derep.fasta"

Now I got a dereplicated fasta-file and I can basically import it as a FeatureData[Sequence] artifact. What I'm doing to get the FeatureTable[Frequency] artifact is just running the qiime vsearch dereplicate-sequences command on the original fasta-file again (Maybe I'm missing that vsearch can create my table as well. I tried to create one with --biomout and nothing happened.). I checked whether the dereplicated sequences from the direct vsearch command and the qiime vsearch command are identical. For a small fasta-file with only 16 original sequences I get 10 sequences with vsearch and 11 sequences with qiime vsearch, whereas the 10 sequences are a subset of the 11. If my original fasta-file contains 59002 sequences then I get 36119 dereplicated sequences with vsearch and 36795 with qiime vsearch out of which 36119 match. So vsearch produces a smaller subset of the qiime vsearch output. I wonder whether I can run into problems when using the FeatureTable[Frequency] artifact from qiime vsearch consisting of more sequences than the FeatureData[Sequence] artifact from direct vsearch?

colinbrislawn · March 5, 2022, 10:06pm

Let's see if we can solve this, as we should be able to perfectly replicate our results.

Take a look a how vsearch is run under the hood of the q2 plugin.

github.com

qiime2/q2-vsearch/blob/dev/q2_vsearch/_cluster_sequences.py

# ----------------------------------------------------------------------------
# Copyright (c) 2016-2025, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------

import tempfile

from q2_types.feature_data import DNAFASTAFormat
from q2_types.per_sample_sequences import QIIME1DemuxDirFmt
import biom
from biom.parse import parse_uc
import skbio.io

from ._cluster_features import run_command


def dereplicate_sequences(sequences: QIIME1DemuxDirFmt,

This file has been truncated. show original

Note the addition of --qmask none, which disables the dust masking that vsearch uses by default. By replicating the command used by the plugin, you should be able to get identical outputs.

Let me know what you find. Hopefully with identical results, we can move on to the next challenge.

appleyield · March 6, 2022, 5:55am

I run,

vsearch
--relabel_sha1
--derep_fulllength $path$datafile$".fasta"
--output $path$datafile$"_derep.fasta"
--qmask none

then the q2-vsearch output and the vsearch output are identical.

Looking into the code directly can be so hepful, if it wasn't so much work to search for the right documents for someone who isn't involved in writing the software package . Thank you for your help!

Sorry Mehrbod for not following your suggestion. Maybe I will have a look at it later and give feedback. For now I mark Colin's answer as the solution.

system · April 6, 2022, 11:56am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.