Different OTU identifiers between rep-seqs and feature table files

Hello everyone,

I am currently trying to analyze MinION sequencing data with QIIME2. Most of my bioinformatics workflow is inspired by the q2ONT pipeline except for the OTU taxonomic analysis that I carried out outside of QIIME2, with the blastn command of the BLAST CLI applications.
However, when I tried to link each OTU taxonomy to its frequency, I faced the problem that the OTU identifiers are not the same in the rep-seq and feature table files produced by the vsearch clustering command.

Here is the clustering command I ran:

qiime vsearch cluster-features-open-reference \
--i-table uchime_ref_out/table-nonchimeric-wo-borderline.qza \
--i-sequences uchime_ref_out/rep-seqs-nonchimeric-wo-borderline.qza \
--i-reference-sequences Reference_sequences.qza \
--p-perc-identity 0.85 \
--p-threads 46 \
--o-clustered-table table-op_ref-85.qza \
--o-clustered-sequences rep-seqs-op_ref-85.qza \
--o-new-reference-sequences new-ref-seqs-op_ref-85.qza

I then exported table and rep-seqs qza files:

qiime tools export \
--input-path rep-seqs-op_ref-85.qza \
--output-path .

mv dna-sequence.fasta rep-seqs-op_ref-85.fasta

qiime tools export \
--input-path table-op_ref-85.qza \
--output-path .

biom convert -i feature-table.biom -o table-op_ref-85.tsv --to-tsv

However, when I inspected the content of these files, I noticed that OTU identifiers are not the same between them, thus preventing me from linking OTU taxonomies to their frequencies in each sample:

head -n 4 rep-seqs-op_ref-85.fasta

>000002d61b9b7a0ef325434391c0158d966ebfc7
GTGCGAAGGTAGCATAATCATTGGATTTTAATTGAAAGCTGGTATGAATGGTTTGATGAAAAATTAACTGTCTCATTTTAATTTTATTAGAATTTTATTTTTAAGTTAAAATGCTTAAATGTTTTATAAAGGCAAGAAGACCCTATAGAGTTTAATATTATAATAATTTATTTATTTTATGTTTTTAATTTAGATTTTTTGTTTTGGTATTTGCTGGGGCGGTTAGAGAAATTTATTTAACTTTTCTTTTATTTTTACATTTATTTTTGAGTTTATGATCCTTTTATTGATTTTAAGATTAAATTACCTTAGGGATAACAGCGTAATTTTTGGAAAGTTCATATTTATAAAAAGTTTGCGACCCCGATGTTGAAC
>000027f5b9a1cd16cf6dbe82b2f5829f02a3071f
GTTCAACATCGGGGTCGCAAACTTTTATAAATATGAACTTTCCAAATTACGCTGTTATCCCTAAAGATGACCCAATCTTAAAATCCAATAAAAAGGATCATAAACTCAAAAATAAATGTAAAAATAAAGAAAAGTTAAATAAATTTTCTATAACCGCCCCAGCAAAATACACCAAAACAAAAAAATCTAAATTAAAAAACATAAAATAAATAAGTATTATAATATTAAACTCTATAGGGTCTTCTCGTCTTTATAAAACATTTAAAAGCATTTTAACTTAAAATAAAATTCTAATAAAATTAAAATGAGACAGTTAATTTTTCATCAAACCATTCATACCAGCTTTCAATTAAAAAACTAATGATTATGCTACCTTCG

head table-op_ref-85.tsv

# Constructed from biom file
#OTU ID output_reads_barcode1   output_reads_barcode2   output_reads_barcode3   output_reads_barcode4
MK820720.1      9.0     5.0     41454.0 59611.0
KT425071.1      1.0     0.0     0.0     0.0
KT272776.1      7.0     3.0     44193.0 63891.0
MG584727.1      3.0     2.0     0.0     1.0
KX461803.1      38.0    19.0    0.0     1.0
MK614510.1      3.0     1.0     0.0     0.0
JX412842.1      4.0     2.0     1.0     1.0
KX087316.1      7.0     1.0     0.0     2.0

In consequence, I would like to know if there is a way to get around this problem and obtain rep-seq and feature-table files with identical OTU identifiers?

Thank you in advance !

Ben

Hi @BDubois, sorry for the slow reply.

Open-reference clustering first checks a sequence to see if it clusters against the reference database. If it does, it will be assigned the reference’s ID. If it doesn’t, de novo clustering is performed, and the sequence ID for the most prevalent sequence in the cluster is used. That means that you’ll have two types of IDs in your outputs: some of the reference IDs, and some of the input IDs. To me that’s what it looks like is going on here, so nothing wrong. You can double check by searching for one of those reference IDs in the rep-seqs-op_ref-85.fasta file, to confirm.

1 Like

Hi @thermokarst, thank you for your reply!

I wasn’t expecting such “mixed” types of IDs but it is indeed the good explanation. I could find reference IDs in the rep-seqs-op_ref-85.fasta file, as well as ASV IDs in the table-op_ref-85.tsv file.

Thanks again for the support!

Ben

1 Like