Importing deblur data from QIITA

I’m working on a meta-analysis of datasets from QIITA for analysis in qiime2. I’ve been able to download datasets processed using deblur from QIITA as well as their respective metadata files. I’ve merged the BIOM and metadata files but in order to continue with my analysis, I require a rep-seqs file. It looks like this file isn’t generated by QIITA from data processed with Deblur. Is there a way I could extract the rep-seqs file from the deblur BIOM file in order to run “qiime diversty core-metrics-phylogenetic” ?

1 Like

Hi @Muslih,

First of, we are working in Qiita to generate a fragment-insertion tree from meta-analyses so you don’t have to do this. We hope to have this ready for the next release (mid to end of March). In the meanwhile, you can run this command:

biom summarize-table --observations -i your_biom.biom | tail -n +16 | awk -F ':' '{print ">"$1"\n"$1}' > rep_seqs.fna

This will be the fna with the rep seqs, which I would suggest using to import and then use with the fragment insertion plugin in Qiime2 to create a tree. Just to be clear, I’m suggesting to use fragment-insertion vs. building a de-novo tree.

Hope this helps.

Cheers,

1 Like

Thanks for the answer @antgonza

So I tried your script and usd the rep_seqs.fna file output to generate a qza file but was unable do due to an error requiring all the characters to be in capitals. I then converted all characters to capital in the rep_seqs.fna file. I was then able to import the file. That too failed as it kept finding duplicates. I used a script to remove the depulicates from bbmap (dedup) and was only then successful. I then ran the fragment insert plugin and was able to generate the insert_tree.qza file. But when I inputted that into core_phylogenetic. I get the following error.

qiita_biom_deblur100 muslih$ qiime diversity core-metrics-phylogenetic --i-phylogeny insertion-tree.qza --i-table Metanalysis_global6study_deblur100.filtered-feature.qza --p-sampling-depth 2814 --m-metadata-file Metanalysis_global6study_metadata_deblur100_filtered.txt --output-dir core-metrics-results_insertion

Plugin error from diversity:

All feature_ids must be present as tip names in phylogeny. feature_ids not corresponding to tip names (n=24): GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAACAAGTGGCGGAATGCT GACggggggggtggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGA TACggggggggCAAGTGTTCTTCGGAATGACTAGGCGTAAAGGGCACTACGGCAGTGAATCGGGTTGCAAGTTCAGGTCGCCAAAAACCGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAACAGTGGCGGAATGCT TACggggggggCAAGTGTTCTTCGGATGACTAGGCGTAAAGGGCACTACGGCAGTGAATCGGGTTGCAAGTTCAAGTCGCCAAAAACCGGCGGAATGCTC TACggggggggCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCTCGTAGGCGGCCAACTAAGTCGGACGTGAAATCCCTCGGCTTAACCGGGGAACTG GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACATAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCT TACggggggggCAAGTGTTCTTCGGAATGACTAGGCGTAAAGGGCACTACGGCAGTGAATCGGGTTGCAAGTTCAAGTCGCCAACAACCGGCGGAATGCT TACggggggggCAAGTGTTCTTCGGAATGACTAGGCGTAAAGGGCACTACGGCAGTGAATCGGGTTGCAAGTTCAAGTCGCCAAAAACCGGCGGAATGCT TACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTCGAAAGTGAAAGTCGCCAAAACGTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTAGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAACAAGTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGTGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTAGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCACAAAGTGGCGGAGTGCT GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGCCGCCAAAAACTGGCGGAATGCT TACggggggggCAAGTGTTCTTCGGAATGACTCGGCGTAAAGGGCACTACGGCAGTGAATCGGGTTGCAAGTTCAAGTCGCCAACAACCGGCGGAATGCT TACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGATTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTAGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAACAACTGGTGGAATGCT TACggggggggCAAGTATTCTTCGGAATGACTAGGCGTAAAGGGCACTACGGCAGTGAATCGGGTTGCAAGTTCAAGTCGCCAAAAACCGGCGGAATGCT GACgggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGC GACggggggggCAAGTGTTCTTCGGAATTACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAACAACTGGCGGAATGCT GACggggggggCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGCGGAATGCT

Debug info has been saved to /var/folders/rw/tf58wydj50sc_7h_sqf66sq40000gn/T/qiime2-q2cli-err-1qxkurvz.log

error_log_file.txt (8.6 KB)

Hi @Muslih,

You basically hit an error that we detected in deblur and have fixed system wide in Qiita. Note that this was also reported and fixed within the q2-deblur plugin. The error is that some tools within deblur will not upper case all their sequences but now it’s enforced.

Anyway, I see 2 options here:

  1. Redo your meta-analysis in Qiita, you should not encounter this problem anymore
  2. Fix your biom with this code in python, taken from this patch:
from biom import load_table
from biom.util import biom_open

biom = [the path to your current biom]
out_biom = [the path to your fixed biom]

t = load_table(biom)
current = set(t.ids('observation'))
updated = set(map(lambda x: x.upper(), current))
difference = current ^ updated

if difference:
    if len(current) != len(updated):
        duplicates = defaultdict(list)
        for key in difference:
            if key in current:
                duplicates[key.upper()].append(key)
        for key in duplicates.keys():
            if key in current:
                duplicates[key].append(key)

        to_merge = {}
        ids_to_replace = {}
        for k, v in duplicates.items():
            if len(v) == 1:
                ids_to_replace[v[0]] = k
            else:
                for vv in v:
                    to_merge[vv] = k
        merge_fn = (lambda id_, x: to_merge[id_]
                    if id_ in to_merge else id_)
        t = t.collapse(merge_fn, norm=False, min_group_size=1,
                       axis='observation', collapse_f=collapse_f)
    else:
        ids_to_replace = {c: c.upper() for c in current
                          if c != c.upper()}

    t.update_ids(ids_to_replace, axis='observation', strict=False,
                 inplace=True)

    with biom_open(out_biom, 'w') as f:
        t.to_hdf5(f, t.generated_by)

Hope this helps.

Thanks for your reply, @antgonza

While option 1. sounds good. I’ve already spent a considerable amount of time on cleaning up the metadata and qza files so I’d rather continue with them.

I tried option 2. and got the following error at the end of the script.

Script ran:

import numpy
import h5py

from biom import load_table
from biom.util import biom_open

biom = (“feature-table.biom”)
out_biom = (“feature-table-fixed.biom”)

t = load_table(biom)
current = set(t.ids(‘observation’))
updated = set(map(lambda x: x.upper(), current))
difference = current ^ updated

def collapse_f(table, axis):
return table.sum(axis=axis)

from collections import defaultdict

if difference:
if len(current) != len(updated):
duplicates = defaultdict(list)
for key in difference:
if key in current:
duplicates[key.upper()].append(key)
for key in duplicates.keys():
if key in current:
duplicates[key].append(key)

    to_merge = {}
    ids_to_replace = {}
    for k, v in duplicates.items():
        if len(v) == 1:
            ids_to_replace[v[0]] = k
        else:
            for vv in v:
                to_merge[vv] = k
    merge_fn = (lambda id_, x: to_merge[id_]
                if id_ in to_merge else id_)
    t = t.collapse(merge_fn, norm=False, min_group_size=1,
                   axis='observation', collapse_f=collapse_f)
else:
    ids_to_replace = {c: c.upper() for c in current
                      if c != c.upper()}

t.update_ids(ids_to_replace, axis='observation', strict=False,
             inplace=True)

with biom_open(out_biom, 'w') as f:
    t.to_hdf5(f, t.generated_by)
checksum = compute_checksum(biom)

Error:

python fix_biom.py

Traceback (most recent call last):
File “fix_biom.py”, line 52, in
t.to_hdf5(f, t.generated_by)
File “/Users/muslih/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/biom_format-2.1.6-py3.5-macosx-10.7-x86_64.egg/biom/table.py”, line 4113, in to_hdf5
h5grp.attrs[‘generated-by’] = generated_by
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/_objects.c:2846)
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/_objects.c:2804)
File “/Users/muslih/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/h5py/_hl/attrs.py”, line 93, in setitem
self.create(name, data=value, dtype=base.guess_dtype(value))
File “/Users/muslih/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/h5py/_hl/attrs.py”, line 171, in create
htype = h5t.py_create(original_dtype, logical=True)
File “h5py/h5t.pyx”, line 1543, in h5py.h5t.py_create (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/h5t.c:18118)
File “h5py/h5t.pyx”, line 1565, in h5py.h5t.py_create (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/h5t.c:17938)
File “h5py/h5t.pyx”, line 1620, in h5py.h5t.py_create (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/h5t.c:17839)
TypeError: Object dtype dtype(‘O’) has no native HDF5 equivalent

Furthermore,

Does the python script deduplicate and merge the features observed once it changes the letters to all caps ?

Hi again @Muslih,

Just for clarity, note that the sample names will not change if you restart the analysis in Qiita so your current mapping file should just work fine.

Anyway, my guess is that the problem is the indentations in your script, to make it easier I'm attaching a script with the code: fix_biom.py (1.4 KB). To confirm: yes, it merges duplicated sequences after uppercasing. Now you can simply do: python fix_biom.py [your_biom] output.biom.

BTW, if you found metadata mistakes in public studies and want to fix them, please send an email to [email protected] with the correction; thanks.

Cheers

Hi again !

The indentations are fine as I still get the same error when using the attached script.

python fix_biom.py feature-table.biom output.biom

Traceback (most recent call last):
File “fix_biom.py”, line 19, in
duplicates = defaultdict(list)
NameError: name ‘defaultdict’ is not defined
(qiime2-2018.2) DC02QN09TGG7V:qiita_biom_deblur100 muslih$ nano fix_biom.py
(qiime2-2018.2) DC02QN09TGG7V:qiita_biom_deblur100 muslih$ python fix_biom.py feature-table.biom output.biom
Traceback (most recent call last):
File “fix_biom.py”, line 40, in
axis=‘observation’, collapse_f=collapse_f)
NameError: name ‘collapse_f’ is not defined
(qiime2-2018.2) DC02QN09TGG7V:qiita_biom_deblur100 muslih$ nano fix_biom.py
(qiime2-2018.2) DC02QN09TGG7V:qiita_biom_deblur100 muslih$ python fix_biom.py feature-table.biom output.biom
Traceback (most recent call last):
File “fix_biom.py”, line 52, in
t.to_hdf5(f, t.generated_by)
File “/Users/muslih/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/biom_format-2.1.6-py3.5-macosx-10.7-x86_64.egg/biom/table.py”, line 4113, in to_hdf5
h5grp.attrs[‘generated-by’] = generated_by
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/_objects.c:2846)
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/_objects.c:2804)
File “/Users/muslih/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/h5py/_hl/attrs.py”, line 93, in setitem
self.create(name, data=value, dtype=base.guess_dtype(value))
File “/Users/muslih/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/h5py/_hl/attrs.py”, line 171, in create
htype = h5t.py_create(original_dtype, logical=True)
File “h5py/h5t.pyx”, line 1543, in h5py.h5t.py_create (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/h5t.c:18118)
File “h5py/h5t.pyx”, line 1565, in h5py.h5t.py_create (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/h5t.c:17938)
File “h5py/h5t.pyx”, line 1620, in h5py.h5t.py_create (/Users/travis/miniconda3/conda-bld/h5py_1491364506370/work/h5py-2.7.0/h5py/h5t.c:17839)
TypeError: Object dtype dtype(‘O’) has no native HDF5 equivalent

All right, I was missing an import and a function definition. @thermokarst figure that part out, thanks! Here is the new version: fix_biom_v2.py (1.6 KB). Now even validated with flake8. Anyway, if this fails again, only 2 things that come to mind:

  1. Are you sure your input file is a valid HDF5 biom table? Note that this is what Qiita generates.
  2. Are you sure you are running on the right environment? From your code it looks like you are and both @thermokarst and I are using that same environment to test.
1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.