Features associated with ASV

megladds · August 9, 2018, 6:15pm

Hello,

So I'm trying to BLAST some of my sequences and I was able to use filter_seqs to separate out the taxonomies that I wanted to target and combine the sequences and taxonomies via feature ID using metadata dabulate and view them.

However I have multiple ASVs that are pretty similar, but I am assuming they are in fact different species.

What I want to do is be able to find which feature IDs and sequences are associated with each of these ASVs so I can then BLAST the correct one to check it.

Is this possible?

Thank you,
Megan

Nicholas_Bokulich · August 10, 2018, 6:10pm

Hi @megladds!

I'm a little confused here — it sounds like you are already using metadata tabulate to merge your sequence and taxonomy feature data... that would contain the feature IDs for each ASV and all other information that you need. Could you elaborate a little more, maybe upload the output QZV to help describe the issue?

Keep in mind that NCBI BLAST will tell you what the top hit is, but that may not actually be the "correct" classification — for short amplicon marker genes it is often difficult to ascertain species, and the taxonomy classifiers used in :qiime2: are already working hard to give you the most reliable, deepest taxonomy classification possible (of course much of that also depends on the reference database quality). It is often useful to get a second opinion from NCBI blast — I often recommend that and do so myself when something doesn't seem right or when I'm focusing on a specific ASV — but I would be wary about if/how I report that assignment.

You can use qiime feature-classifier classify-consensus-blast with the parameter --p-maxaccepts 1 to get a top-hit assignment to whatever reference database you want to use... so you can use that to replicate the functionality in NCBI blast (same algorithm, and same database if you download and format it correctly). So it's all possible to automate blast searches against the NCBI nucleotide database — but we just don't recommend it because it's not that accurate for short amplicon seqs.

Not to discourage you from what you are doing — you are probably already aware of the issues above — I just feel compelled to mention them

megladds · August 10, 2018, 6:59pm

Sorry, maybe I am confused then.

My final output after running through a classifier using the SILVA database has ASVs that are labelled as such

D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta;D_4__Chlorophyceae;Ambiguous_taxa;Ambiguous_taxa;Ambiguous_taxa;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__

D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta;D_4__Chlorophyceae;Ambiguous_taxa;D_6__;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__

I'm assuming that the final classified results are each unique species, but they are ambiguous since that level didn't quite match.
What I want to do is use NCBI BLAST on some of these sequences associated with those above similar ASVs and I just want to be confident that I'm putting the right sequences for each ASV in to BLAST.

I guess what I don't know is whether all the features and their assigned taxonomies are matching the overall ASV output (as in the features will match the ASV taxonomy output exactly).

I'm sorry this is a confusing question.

Nicholas_Bokulich · August 10, 2018, 7:18pm

Maybe maybe not — it is impossible to tell since these are ambiguous levels in the SILVA database

SILVA has different taxonomy reference files (depending on which version of the qiime-compatible release you are using). Looks like you used "all levels" but there should also be a "consensus" and "majority" taxonomy files with seven levels. I would recommend using one of those instead, as these should have fewer ambiguous labels.

These are the labels in SILVA, not inexact matches with q2-feature-classifier — it actually looks like you have a species-level match! But that species has no species label in SILVA.

Got it. I would recommend re-classifying with a different taxonomy file and see if that changes things — you could also filter out those specific features and try reclassifying with a different database or just export the sequences and do a batch upload to NCBI blast.

But I agree, when the databases have ambiguous labels like this I would turn to blast (or elsewhere) to get a better idea of the ID if those are important ASVs... if these were truncated classifications from q2-feature-classifier (e.g., something like D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta), that would indicate there are several good matches and the taxonomy cannot be resolved so I would probably not bother with re-classifying in that case.

Looks like these particular taxa are chloroplast! Maybe you are looking at plants with 18S... but if you are looking at bacteria you probably want to just filter these out anyway (I'm assuming!) since it would most certainly be host plastid DNA.

The ASVs are labeled by the feature IDs... usually this is a unique string of numbers and letters like a6f8da6gb6b8g6ads86gj09fdfa. Those unique feature IDs will be used to label all ASVs in the feature table, in the sequences, and in the taxonomy classification.

I am not 100% certain what you mean here but I think yes, that same feature ID will be used in the feature table and all feature data files.

Not at all! Sorry I am having trouble interpreting — we are here to sort out confusing problems

megladds · August 10, 2018, 7:29pm

Alright I think I understand now.

So how it works is my sequences are run through the classifier I built using SILVA (you're correct, I'm looking at 18S plant data) and then each sequence is assigned a taxonomy based on that classifier and then all the assigned sequences are combined (added up) and that would be my final counts? So the taxonomic assignments would be the same for all of the same sequences? Is that right?

Nicholas_Bokulich · August 10, 2018, 7:58pm

Let's say you have a feature table that looks like this:

SampleID     feature1    feature2    feature3
s1                7          1          9
s2                3          6          2
s3                0          13         0

The accompanying sequences file would look like this:

feature1       ACGTGATCGTAGCTGAC
feature2       TTGTGATCCTGAAGCTG
feature3       TGTGTGACGTAGCTGAC

and if you attempt to classify those sequences you will get a FeatureData[Taxonomy] artifact that looks like this on the inside:

feature1       D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta
feature2       D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta
feature3       D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__SomethingElse

So the feature IDs in the feature table, sequences, and taxonomy all correspond. If you use qiime taxa collapse to collapse your feature table by taxonomy, you would get something like this:

SampleID     D_3__Chlorophyta    D_3__SomethingElse
s1                  8                   9
s2                  9                   2
s3                 13                   0

So that command will aggregate ASVs that share the same taxonomy, summing their frequencies. But unless if you use that command the ASVs will all remain as distinct features in the feature table.

If you use qiime taxa collapse, yes, those ASVs given the same assignment are combined. Otherwise, no. (qiime taxa barplot also sums these together when creating taxa barplots)

All of your ASVs have unique sequences... that's what makes them distinct ASVs (and their feature IDs are actually unique names that will always be given to that exact same sequence! So if you did the same analysis twice, or merged multiple runs of the same read/trim length and amplicon, you will be able to compare the same ASV in each run)

A single ASV should ideally be assigned the same taxonomy if you attempted to classify it multiple times. Multiple ASVs can receive the same assignment — even at species level — because strain-level differences can exist within the species (I do not know how variable 18S is for plants, but this is usually the case for other amplicons). So ASVs do not always represent unique species, just unique sequences.

I hope that clarifies!

megladds · August 10, 2018, 10:08pm

Yes that clarifies things a bit. I used taxa barplot so things are combined/summed for me at the end, which is I think where I was getting the most confused (since the original feature table has each one individually).

So, when I use taxa barplot it would combine any that have EXACTLY the same taxonomic assignment or does it just need to be similar enough..?

Nicholas_Bokulich · August 10, 2018, 10:11pm

It combines all ASVs that have the exact same taxonomic assignment up to level X.

If you want to view the abundance of each ASV separately, you can use qiime feature-table heatmap. (heatmap displays abundance of each feature in each sample — so those features can be taxa if you use qiime taxa collapse, or ASVs if you do not)

megladds · August 10, 2018, 10:16pm

Alright, thank you!

system · September 11, 2018, 4:16am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.