Issue assigning taxonomy to Oxford Nanopore reads

Dear QIIME2 developers,
I am trying to import Oxford Nanopore MinION reads in qiime2-2019.7 for assigning taxonomy without doing any clustering. I have already demultiplexed and quality filtered them, and I am working with 3 fastq.gz files. I know that, after importing sequences as SampleData[SequencesWithQuality] I am supposed to dereplicate them, in order to get a FeatureData[Sequence] and a FeatureData[Frequency] artifacts. However, qiime vsearch dereplicate-sequences is changing the reads IDs, and this is causing issues with the downstream analysis.So, I kept only the table.qza FeatureData[Frequency] artifact produced by vsearch dereplicate-sequences and re-imported my reads converted to fasta as a FeatureData[Sequence] type artifact, but qiime feature-classifier classify-consensus-blast is complaining about the format, giving this error:

Plugin error from feature-classifier:

/tmp/qiime2-archive-lq34dzub/9f66d5cf-09b9-4b09-94f3-1960b1ab2f0b/data/dna-sequences.fasta is not a(n) DNAFASTAFormat file:

Invalid characters on line 22 (does not match IUPAC characters for a DNA sequence).

Debug info has been saved to /tmp/qiime2-q2cli-err-dgnhghd_.log

The commands I am using are:

DB=/home/simone/QIIME2_analysis/database/PRJNA33175_Bacterial_16S_sequence.qza
TAXONOMY=/home/simone/QIIME2_analysis/database/PRJNA33175_Bacterial_16S_taxonomy.qza
READS_MERGED=/home/simone/QIIME2_analysis/MinION_test/reads/reads.fasta
MANIFEST=/home/simone/QIIME2_analysis/MinION_test/manifest.txt

qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path $MANIFEST
–input-format ‘SingleEndFastqManifestPhred33V2’
–output-path sequences.qza

qiime vsearch dereplicate-sequences
–i-sequences sequences.qza
–o-dereplicated-table table.qza
–o-dereplicated-sequences rep-seqs.qza

rm rep-seqs.qza #removing them because of renaming issue

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path $READS_MERGED
–input-format ‘DNAFASTAFormat’
–output-path rep-seqs.qza

qiime feature-classifier classify-consensus-blast
–i-query rep-seqs.qza
–i-reference-reads $DB
–i-reference-taxonomy $TAXONOMY
–p-perc-identity 0.8
–p-maxaccepts 1
–o-classification taxonomy.qza

What do you think I should do?
Thank you very much

Welcome to the forum @MaestSi!

How so? Is it actually mutating an existing feature ID or because the feature ID assigned to the dereplicated sequences is a hash ID? Since you are dereplicating sequences, the original feature IDs will become largely meaningless (unless if your ideal is to pick a single representative sequence and use its original feature ID downstream). Using the table but importing sequences with different feature IDs will give you a whole other pack of troubles downstream… I recommend sticking with both the table and sequences output by q2-vsearch, and there may be other tricks to map these feature IDs to your original sequences later on.

This is a very clear and well-intentioned error. You have invalid characters in your fasta file! You should look at those raw sequence files and remove the invalid characters. These could be non-IUPAC characters masquerading as DNA sequence, they could be spaces on the lines, or they could be a sign of a corrupted file. So a very noble error to warn you of potential data corruption before you get too far into your analysis!

I strongly recommend using the dereplicated sequences output by q2-vsearch — the issues it may raise with changing the feature IDs are solvable and minor compared to the issues caused by discarding these sequences.

Good luck!

1 Like

The reason why I am trying to do dereplication (with sequences which have a 10% mean error rate) is just for having artifacts with type FeatureData[Frequency] and FeatureData[Sequence].
If I have a look at imported sequences in file sequences.qza artifact after unzipping, the header of the first sequence is like this: @f144ba72-79c2-41c4-aded-9e9d9350afa2 runid=370af2633dbc2e9880e8a8c0c6aaab56fc358516 read=1007 ch=488 start_time=2018-08-01T10:38:38Z barcode=barcode01. Sequences are exactly as the ones that I have imported, with all bases on a single line.
Then, if I have a look at the rep-seqs.qza artifact after unzipping, only the first part of the original header is retained and a portion related to the sample name is added, e.g. >62eea7f79bf968e86bd062607a24db3577ee0a17 BC01_0; however, I noticed that in this fasta file (which, after unzipping is called dna-sequences.fasta), sequences have been reformatted, and the length of each row is now 80 nt.
IDs in the corresponding table.qza (after unzipping the file is called feature_table.biom) are instead of this kind: 62eea7f79bf968e86bd062607a24db3577ee0a17, so they include only the first part of the header).
If I then run taxonomic assignment, the error is the one I previously reported:

/tmp/qiime2-archive-lq34dzub/9f66d5cf-09b9-4b09-94f3-1960b1ab2f0b/data/dna-sequences.fasta is not a(n) DNAFASTAFormat file:
Invalid characters on line 22 (does not match IUPAC characters for a DNA sequence).

My guess is that the reformatting or the renaming, maybe due to spaces in the original headers, is causing the issue.
What do you think?
Thanks

Dear @Nicholas_Bokulich ,
I think I solved it, since Blast has now been running for a few hours without giving any error messages.
The problem was that the invalid characters on line 22 was not in my sequences rep-seqs.qza (which were called dna-sequences.fasta too), but it was in the reference database I downloaded from NCBI, since it contained newlines between each entry.
Just for your information, as you kindly advised I sticked to using the output from q2-vsearch.
Thanks,
Simone

Thanks for confirming @MaestSi! Sounds like you solved the issue since it is running.

Dear @Nicholas_Bokulich,
unfortunately there are still some issues in the taxonomic classification step.
From what I can see:

  • In the rep-seqs.qza, sequences headers all have the ‘samplename_counter’ suffix
  • In the table.qza, sequences header don’t have any suffix
  • In the taxonomy.qza, sequences are duplicated: for each sequence, there is one entry with the original header (with its own taxonomy assigned) and one entry with the ‘samplename_counter’ suffix with ‘Unassigned’ taxonomy.

Here I report the first lines of taxonomy.qzv visualized in qiime view.

Do you think I am doing something wrong?
Thanks

I was able to reproduce the same behavior also with a subset of Illumina data.
I just considered R1 of a paired-end experiment, imported them, dereplicated them with vsearch dereplicate-sequences and assigned taxonomy with feature-classifier classify-consensus-blast.

qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path $MANIFEST
–input-format ‘SingleEndFastqManifestPhred33V2’
–output-path sequences.qza

qiime vsearch dereplicate-sequences
–i-sequences sequences.qza
–o-dereplicated-table table.qza
–o-dereplicated-sequences rep-seqs.qza

qiime feature-classifier classify-consensus-blast
–i-query rep-seqs.qza
–i-reference-reads $DB
–i-reference-taxonomy $TAXONOMY
–p-perc-identity 0.8
–p-maxaccepts 1
–o-classification taxonomy.qza

qiime taxa barplot
–i-table table.qza
–i-taxonomy taxonomy.qza
–m-metadata-file sample-metadata.tsv
–o-visualization taxa-bar-plots.qzv

The error message is:

Plugin error from taxa:

Feature IDs found in the table are missing from the taxonomy: {‘cfaff36ed515430d643873a56626a183b326c1ce’, ‘3fc92e025a4125880880be1b42384bfb219d5768’, ‘299421424595900e8a263ec60a7ea157b074d22f’, ‘c84db8a467d652c507c7630fcdae70ccc2a0fde3’, ‘aebf2999f8979a930246432c5d09c16d0b56e951’, ‘ebccbef702c6bd78a8f299418151293bb3dd88c2’, ‘4ed60e4f1287b3ab63b808b1e8b9ce66c2bc156b’, ‘1f7f5575bf3ffcb9b308e39c4801d223549b9ba1’, ‘4634584b1b4694c07e6914a9b5c3e96c4d4bee91’, ‘26308f5b262017a473ca6c5b33cab471c06e9d8d’, ‘d0133a16754f6243c396fc7448d05c1dbf234340’, ‘3e9d8d3da4d66ce3c4930dced40fae38189cba15’, ‘e7dab4d5d64b5357e5ab654ee49543703b7ac660’, ‘059270ccb6720dd24498a0ba865e96f5fec80a33’, ‘e563469fa469f3de2ef805e5b4668a16c37cf8fa’, ‘afb2b81a7d574b212ba508058fa25993927e8341’, ‘580bf4dc33e418e34c9b9f5870c6e5481f85d6be’, ‘de919037d061903d7d0bf696ce60e93e4ee932ef’, ‘1450422365785c09d219dcb7d9e483f53a8383b7’, ‘7447f3e06da87fe124f09c749f707995931949b0’, ‘986c73541e0e119ea764db0598959e458d4b1b6f’, ‘9ed63a6d683b67ec9ebdcb18dca4ce23f4c535c2’, ‘84b715cc41a0dd6bd2e930c58c16fc7908c0dc4d’, ‘1c78399bea401b25d14d125d921fc6e288118c4d’, ‘93a8d7bde40bae71985d7952eee89dcc35450895’, ‘4b2f9c9e435953f550b782f5be23126e1cce7458’, ‘7908efec5bcce7cde2b4f1ae1d5dee544ac42431’, ‘edcdfeb5027a6b54ffb4d43a94fc5454baae27ea’, ‘4225bcbb9bf3c407fa5821e134d95487e2de5bb6’, ‘9adf74c412fc1bcc798d2d4071ee5c203708e024’, ‘c820667b9c82a94af9879a2683b905e030253337’, ‘f2d063bf6a9c683bb52371f5127bec2866af6b99’, ‘9abe695ea139d10bf724a0b7159a2a0255d485a9’, ‘3e47f8d81e9b2b64b242d74dd18ff59f4cd86144’, ‘7735476ea47d6aaf83eb392709f5be2751493f4e’, ‘31071cd61baa488ae9a4dff4c572808ca3d17c94’, ‘d0a9904589a412e0a639c3d6c546c22decbd4ef3’, ‘adbba612a0f99485c3ca0cab5f4011ce3950b78c’, ‘b347d7b4c183366fcad6585b5c849edfb242c72e’, ‘be598d6c718bb9a5ca3c5c93fbc5079ed8c381cd’, ‘d9925f5e577e200e6b57987891bd4eb5f46ad45a’}

This is a different error from what you are describing above.

Could you please share some of your QZA files so that we can attempt to debug locally?

You are right, I thought the error was related to failures in dereplications due to high error rate, but found out that it is not the case actually.
Attached, you can find some qza artifacts: subset_analysis.zip (2.6 MB)
Let me know if you need some more files or information.
Thanks

Hi, I am writing to inform you that I found a temporary fix to the issue I am facing.
I unzipped taxonomy.qza, renamed taxonomy.tsv to taxonomy_original.tsv and ran this:

SAMPLE_NAME=Mock_Sample
cat taxonomy_original.tsv | grep -v $SAMPLE_NAME > taxonomy_assigned.tsv
cat taxonomy_assigned.tsv | grep -v Feature | cut -f1 > assigned_IDs.txt
cat taxonomy_original.tsv | grep $SAMPLE_NAME | grep Unassigned | sed ‘s/ ‘"$SAMPLE_NAME"[0-9]*’//g’ > taxonomy_unassigned_tmp.tsv
cat taxonomy_original.tsv | grep -v Feature | sed 's/ '"$SAMPLE_NAME"
[0-9]*’//g’ | grep -v -f assigned_IDs.txt > taxonomy_unassigned.tsv
cat taxonomy_assigned.tsv taxonomy_unassigned.tsv > taxonomy_fixed.tsv
rm taxonomy_unassigned_tmp.tsv assigned_IDs.txt

Finally, I reimported the taxonomy_fixed.tsv file as a FeatureData[Taxonomy] artifact, and I was then able to obtain the barplots.
Thanks,
Simone

Hi @MaestSi,
Glad to hear you found a workaround!

It looks like this is a known bug, so will be fixed in a future release of QIIME 2: https://github.com/qiime2/q2-vsearch/issues/57

I also have a theory about what could be an easier workaround: what if you use dereplicate and then cluster-features-de-novo --p-perc-identity 1.0 (cluster into 100% OTUs). The second step would be technically unnecessary, but I wonder if it may reformat the feature IDs? (I do not know — have not dug into the code — but wonder if that may be occurring since we do not see this error more commonly, since most users using dereplicate is using OTU clustering afterwards). Would you mind giving that a spin and seeing what happens?

Hi @Nicholas_Bokulich,
the solution you proposed works too and can be implemented in a pipeline, so it is much better, thank you!
Simone

1 Like

Thank you @MaestSi for confirming!