Tracing ASV to .fastq sequences to ID Unassigned

Hi there Q2 White Wizards:

I have ran the following commands:
qiime feature-classifier classify-consensus-vsearch --i-query rep-seqs.qza --i-reference-taxonomy taxonomy_7_levels.99.qza --i-reference-reads silva_132_99.qza --o-classification Baha_taxa.qza --p-maxaccepts 1 --p-threads 25

qiime taxa barplot --i-table table.qza --i-taxonomy Baha_taxa.qza --m-metadata-file Bahamas_Metadata.txt --o-visualization Baha_taxa-bar-plots.qzv

And used the Qiime2 View to see these Baha_taxa-bar-plots.qzv. Unfortunately, my sequences are 70% unclassified. Boo!! So, looking at past forums such as this one: Classify-consensus-vsearch - high number of unassigned features I can see I am not alone in this. In his answer, Nicholas stated: "_Pull out some of the sequences that are not classifying, and run an NCBI blast search on them to see what these seqs could be." So, I'm on a dark, Alice in Wonderland quest to do just this.

I am trying to pull out all of my fastq sequences associated with the Unassigned ASV's to either 1) run blast on them or; 2) re-run another way of annotation w/in Qiime2. I must note that I have done this with Qiime 1 and I ended up with ~30% Unclassified's, but I know the answer bc I read it in another post - "maybe this is the correct annotation and not Qiime 1". So, onward on my quest:

After following this tutorial: Exporting and modifying BIOM tables (e.g. adding taxonomy annotations) on how to put your ASV, Taxa, and sample counts together (by the way at the end of that tutorial you get 2 files, one with ASV, the other with sample counts with ASV. I had to merge them in excel, but I digress ...).

Armed now with the ASV's, I found this tutorial Filtering fasta sequences file based on a list of OTUs - #5 by Ina that tries to filter fasta sequences from a list of OTUs. So, my data is is .fastq, but I don't think this matters. I noticed the --m-metadata-file was the list of OTUs, so

I tried to do just this:

  1. Used grep to remove all ASV that were labeled Unassigned and saved it to ASV_Unassigned.txt
    $head ASV_Unassigned.txt #These are all ASV's that are Unassigned
    #Sample ID
    63fe5b9af4b8589c284b7197a5b0abe2cb207a06
    6dd61a4f4eb1c52de616c5ba367a9eb18aa9c21e
    fc92df8284c115b32eb8b9aa193c27559186a18c
    d82a416ecea8d18af1ed0b5d48f2c43ddbce73c6

Next, I ran this, which promptly gave me the following error:

qiime tools import –input-path seqs.fna –output-path seqs.qza –type FeatureData[Sequence]

qiime feature-table filter-seqs --i-data seqs.qza --m-metadata-file ASV_Unassigned.txt --o-filtered-data ASV_filtered.qza
Plugin error from feature-table:

All features were filtered out of the data.

Debug info has been saved to /var/folders/z4/jfbc76zs7_l8q_4xb8rgtmf00000gn/T/qiime2-q2cli-err-uk08gy61.log

(qiime2-2018.8) ssrb-vpn1-5-82:Testing $ cat Debug info has been saved to /var/folders/z4/jfbc76zs7_l8q_4xb8rgtmf00000gn/T/qiime2-q2cli-err-uk08gy61.log
cat: Debug: No such file or directory
cat: info: No such file or directory
cat: has: No such file or directory
cat: been: No such file or directory
cat: saved: No such file or directory
cat: to: No such file or directory
Traceback (most recent call last):
File "/Users/miniconda2/envs/qiime2-2018.8/lib/python3.5/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "", line 2, in filter_seqs
File "/Users/miniconda2/envs/qiime2-2018.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/Users/miniconda2/envs/qiime2-2018.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 362, in callable_executor
output_views = self._callable(**view_args)
File "/Users/miniconda2/envs/qiime2-2018.8/lib/python3.5/site-packages/q2_feature_table/_filter.py", line 112, in filter_seqs
raise ValueError('All features were filtered out of the data.')
ValueError: All features were filtered out of the data.

Ok .... I don't speak code hieroglyphic, so I'm wondering 1) Is this way of trying to ID the .fastq sequences of the ones annotated as Unassigned is correct and; 2) What does that error mean?

Many Thanks!!
Purrsia

1 Like

Hi @Purrsia_Felidae,
Thanks for the detailed workflow explanation! That is one way to do this, but I think I can propose some easier routes.

So at the end of the day you want to know which ASVs are unassigned? And maybe reclassify those with a different method to get a "second opinion"?

One easy, interactive way to do this is use qiime metadata tabulate exactly as described here to merge your sequences with your taxonomy. You will get a visualization that you can search or sort however you like (e.g., by taxonomy). This is useful, e.g., if you want to look at different assignments. Maybe unassigned seqs aren't the only problem? Maybe you want to see what sequences were only classified to phylum level? This is the method for you.

Another easy thing to do is use qiime taxa filter-seqs as described here to keep only sequences that are unassigned. Then:

  1. You can reclassify just those sequences using a different classifier
  2. You can then use qiime feature-table tabulate-seqs as described here to get a visualization of those sequences. Click on any sequence to link out to an NCBI BLAST search of that sequence... convenient for quick exploration.

I think this should do what you want, with a whole lot less trouble. But please let me know if I'm missing something!

Now on to the problem itself: you have 70% unclassified! You have browsed around the forum so have seen the reasons why this may be occurring. Having not seen your barplots, it sounds like you have a mixture of well-classified sequences (e.g., genus? species?) but lots and lots of unclassified. In those cases it usually is not an issue with the classifier (unlike when all sequences are poorly classified, in which case it is usually user error, e.g., wrong reference sequences). When there is a mixture, the unclassified sequences are usually:

  1. something that should not be classified! E.g., plant DNA or other non-target DNA that is not in your reference database.
  2. your sequences are in mixed orientations (i.e., forward and reverse complement relative to the reference database). The classify-sklearn method can currently only handle sequences in a single orientation... the classify-consensus methods can handle mixed orientations, though! So give those a try if you know your sequences are mixed orientation.

I hope that helps!

Dear Nicholas:

Thank you very much for you reply. My apologies for responding so late - I had several jobs to run before I could answer w/ the results of your suggestions.

First off, thank you for your suggestion to use the classify-consensus method. As you can see above, this was the method that I used: qiime feature-classifier classify-consensus-vsearch.

Secondly - upstream, my sequences were pulled out using SortMeRNA tool against the SILVA_132_SSURef_Nr99_tax_silva.fasta (downloaded from the SILVA website directly) at a .97 cut off. So, there isn't any plant or non-target DNA. All of these sequences aligned with that reference database. So, theoretically, they should also be in the correct orientation.

Third, I wanted to say a HUGE, BIG thank you for the qiime metadata tabulate command. It was exactly what I was looking for. I also wanted to point that I ran into a bit of an issue with it. Here is what i did:

First: how I got my rep-seqs.qza
Qiime_1: split_libraries.py command to get a seqs.fna

Qiime_2: qiime tools import to import the seqs.fna as --type SampleData[Sequences] --ouput-path seqs.qza --input-path seqs.fna

qiime vsearch dereplicate-sequences --i-sequences seqs.qza --o-dereplicated-table table.qza --o-dereplicated-sequences rep-seqs.qza

As you can see above, the Baha_taxa.qza was generated by the qiime feature-classifier classify-consensus-vsearch command. And the --i-reference-reads silva_132_99.qza I got by importing the silva_132_99.fna that is available on the SILVA website under their directory Qiime.

Now I ran the command you suggested to get the following error:
qiime metadata tabulate --m-input-file rep-seqs.qza --m-input-file Baha_taxa.qza --o-visualization tabulated-feature.metadata.qzv
There was an issue with viewing the artifact SB_taxa.qza as QIIME 2 Metadata:

CategoricalMetadataColumn does not support values with leading or trailing whitespace characters. Column 'Taxon' has the following value: 'D_0__Bacteria;D_1__Planctomycetes;D_2__Phycisphaerae;D_3__Phycisphaerales;D_4__Phycisphaeraceae;D_5__uncultured;D_6__uncultured bacterium '

As you can see there is a white space here. So it appears that the taxonomy_7_levels.99.qza (which is just the taxonomy_7_levels.99.txt on the SILVA website under the Qiime directory, imported as a .qza file) has trailing white spaces.

So, I inputted the taxonomy_7_levels.99.txt into R and ran the command:
taxonomy$Taxon <- trimws(taxonomy$Taxon)
colnames(taxonomy)[1] <- "Feature ID"

Next, I imported back into Qiime_2
qiime tools import --input-path taxonomy.tsv --type FeatureData[Taxonomy] --output-path SB2_taxa.qza
Imported taxonomy.tsv as TSVTaxonomyDirectoryFormat to SB2_taxa.qza

Now I could run the tabulate command:
qiime metadata tabulate --m-input-file rep-seqs.qza --m-input-file SB2_taxa.qza --o-visualization tabulated-feature.metadata.qzv

Ok - after a few nights and TONS of coffee later ....
I was able to hunt down all my unclassified sequences through awk and grep commands to end up with a file called: unassigned.fna. This was all of my fasta sequences that came back as 'Unassigned' by the classify-consensus-vsearch (which as 73% of my data, not 70% as I mentioned above, that was a typo).

I ran the following command on this to get a "second opinion" of sorts and see if I can classify more of these sequences, so I used qiime tools import to import my unassigned.fna, dereplicate them (got a unassigned_rep-seqs.qza) via the qiime vsearch dereplicate-sequences command, and re-ran with the following:

qiime feature-classifier classify-consensus-blast --i-query unassigned_rep-seqs.qza --i-reference-taxonomy taxonomy_7_levels.99.qza --i-reference-reads silva_132_99.qza --o-classification unassigned_taxa.qza --p-maxaccepts 1

This ended up classifying only 5% of my unassigned sequences.

So, before using the classify-consensus-vsearch, I had 27% assigned (bc 73% unassigned) + 5% assigned, brings me to a total of 32% assigned (assuming I merge the 2 ASV tables).

Next, I blasted some of these unassigned sequences via good old fashioned blast, and these are good sequences with insanely low e-values and % identities of above the cut-off threshold for the commands I use.

Ok - At this point, I went "old-school" and busted out with Qiime_1 to see if I was having the same issue there bc previously, I've used Qiime_1 with the exact same workflow (I've published this already) with nowhere near as many unassigned sequences.

So, using the same database as I used for Qiime_2 (the one downloaded in the SILVA website under their Qiime directory), I ran the following command:

pick_open_reference_otus.py -i seqs.fna -p parameters.txt -r silva132_99.fna -o SB_Qiime1_taxa
where my parameters.txt was:
align_seqs.py:template_fp 80_core_alignment.fna
assign_taxonomy:reference_seqs_fp silva132_99.fna
assign_taxonomy:id_to_taxonomy_fp taxonomy_7_levels.99.txt

All of my files in the parameters.txt were included in the Qiime directory in the SILVA website.

Here are my results:
Using the otu_table_mc2_w_tax_no_pynast_failures.biom generated, from the command, I ended up with 2225 sequences as Unassigned or 6.03%

I am not sure why there is such a large discrepancy. I see that qiime_1 uses the uclust picking method by default while the classify-consensus-vsearch performs a VSEARCH global alignment. Based on the results I am getting, I get way more assignment using the uclust picking method for my own personal data.

Lastly, I wanted to say thanks for your suggestions!!

Thanks for clarifying, I missed that above. Yes, that is much more surprising — I would expect fewer unclassified sequences with vsearch (if only because it is more transparent to use and does not have the mixed read orientation issue).

There are lots of differences here, so this is effectively comparing apples vs. oranges. It is difficult to discern whether these differences are due to classification issues or something else.

  1. vsearch and uclust are both global aligners, but different default consensus taxonomy parameters are being used for classification in qiime1 vs. qiime2 classify-consensus-vsearch
  2. You are looking at the otu_table_mc2_w_tax_no_pynast_failures table (some of your QIIME 2 unclassified sequences could be failing alignment in QIIME 1)
  3. You are comparing OTUs vs. ASVs (I assume). This is going to dramatically change the # of sequences and distort this comparison all out of proportion — e.g., OTU picking will not remove noisy sequences, so you will get a long tail of low-abundance (e.g., abundance = 1) OTUs that might not be unassignable (because they will be different length or > 3% different from other OTU centroids so become their OTUs, even though they may just be error-riddled copies of other classifiable sequences). You should collapse both qiime1 and qiime2 results on taxonomy and then determine what proportion of taxa are unassigned, this will be a quantitative comparison, though it does not deconvolute this comparison.

I recommend adjusting the classify-consensus-vsearch parameters, e.g., you could change maxaccepts to 1, which will just find the top hit for each sequence. This is not recommended for accurate classification unless if you at least set perc-identity to a really high value, but you can at least do that as a sanity check (unassigned sequences should go way down unless if something is wrong with (most likely) the reference db or the query seqs (which I think you've ruled out)).

Please give these recommendations a spin and let us know what you find!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.