.FASTA files & metadata files not jiving

jhines1 · March 22, 2019, 7:46pm

Hello again! I'm back with yet another roadblock that I don't seem to understand. I have imported my dereplicated Sanger sequences from a .fasta file:

>HCl_49
TGGTTNT...
>HCl_50
TGGTTCT...

to a .qza, and I have a metadata file:

#SampleID	Series	Salt	Concentration
#q2:types	categorical	categorical	categorical
HCl_49	HCl	NaCl	High
HCl_50	HCl	NaCl	High

& have used q2-vsearch, q2-ghost-tree, etc. But they do not seem to jive together when I run something like chimera filtering, diversity analyses, or anything involving a metadata file. For example, when I try to run:

 qiime diversity alpha-group-significance \
 --i-alpha-diversity Hines_observed_otus_vector.qza \
 --m-metadata-file Hines_metadata.txt
 --o-visualization Hines_observed_otus_vector.qzv \
 --verbose

I get the error:

Traceback (most recent call last):
  File "/Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in __call__
    results = action(**arguments)
  File "</Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-389>", line 2, in alpha_group_significance
  File "/Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
    output_types, provenance)
  File "/Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 427, in _callable_executor_
    ret_val = self._callable(output_dir=temp_dir, **view_args)
  File "/Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_diversity/_alpha/_visualizer.py", line 38, in alpha_group_significance
    metadata = metadata.filter_ids(alpha_diversity.index)
  File "/Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/metadata/metadata.py", line 727, in filter_ids
    ids_to_keep)
  File "/Users/haselkornlab/anaconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/metadata/metadata.py", line 203, in _filter_ids_helper
    % (', '.join(repr(e) for e in sorted(missing_ids))))
ValueError: The following IDs are not present in the metadata: 'HCO3', 'HCl', 'LCO3', 'LCl', 'SW'

Plugin error from diversity:

  The following IDs are not present in the metadata: 'HCO3', 'HCl', 'LCO3', 'LCl', 'SW'

I have tried to rename my sample ID's in the original .fasta file & the metadata from 'HCl_49' to 'HCl49' or 'HCl.49', but in doing so I get the import error:

There was a problem importing Hines_SeqData_Final.fasta:

  Hines_SeqData_Final.fasta is not a(n) QIIME1DemuxFormat file

It seems to only accept is as a demux file when I use the "_" in the headers.

So, I'm honestly not sure how to proceed. I feel like I'm either missing something really simple/obvious, or my data need to be completely reformatted to fit into the q2 workflow.

Any help/insight is GREATLY appreciated!

thermokarst · March 22, 2019, 8:13pm

How did you import? Did you dereplicate the sequences after import? Where did you get the feature table for computing the observed otus vector?

jhines1 · March 22, 2019, 9:08pm

I should start by correcting my OP. I started with DEMULTIPLEXED sequences, not dereplicated sequences.

To import, I used the code:

qiime tools import --type 'SequenceData[Sequences]' --input-path Hines_SeqData_Final.fasta --output-path Hines_RepSeqs.qza

Yes, I used q2-vsearch to dereplicate the sequences after import, but not before following the q2-ghost-tree community tutorial guidelines for importing & rooting a pre-built tree, downloading the UNITE database, clustering my sequences into 99% OTUs, & created artifacts for my OTU table.

qiime vsearch dereplicate-sequences \
  --i-sequences Hines_Seqs.qza \
  --o-dereplicated-table Hines_table.qza \
  --o-dereplicated-sequences Hines_rep-seqs.qza

qiime vsearch cluster-features-closed-reference \
  --i-table Hines_table.qza \
  --i-sequences Hines_RepSeqs.qza \
  --i-reference-sequences sh_refs_qiime_ver7_99_01.12.2017_dev.qza \
  --p-perc-identity 0.99 \
  --o-clustered-table Hines_table_99.qza \
  --o-clustered-sequences Hines_RepSeqs_99.qza \
  --o-unmatched-sequences Hines_unmatched_99.qza

I then used the q2-diversity plugin to create that specific table using commands from this post.:

  qiime diversity alpha \
  --i-table Hines_table_99.qza \
  --p-metric observed_otus \
  --o-alpha-diversity Hines_observed_otus_vector.qza

thermokarst · March 22, 2019, 9:58pm

Perfect, thanks for clarifying! Just a minor note:

The SampleData[Sequences] input is technically multiplexed, not demultiplexed. Multiplexed means that all the reads from all the samples are in the same file.

Okay, so, I think I know what is happening here --- your sample IDs are being mangled on import --- technically the format uses the number after the underscore as the read number, which is effectively stripping the last bit off of your sample IDs. This is a shortcoming in documentation on our part, but, if you followed the OTU Clustering Tutorial there is a brief reference to the format description.

Taking a step back - do you have access to the original sequencing data? That is, sequences with quality scores? If so, it might be best to start there. If not, you will need to update the ID lines of Hines_SeqData_Final.fasta to include a read ID. It doesn't matter what that ID is, you just need something there.

:qiime2:

jhines1 · March 22, 2019, 10:13pm

Awesome! Thanks for the reply & the help. I'll take a look at this & try to correct things Monday.

Thanks again!

system · April 23, 2019, 4:13am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.