Can I turn fasta into qza for taxonomic classification?

microbiotaphyto · October 4, 2022, 12:14pm

Hello everyone,

I have a database ready to use for taxonomic classification, but it's a fasta file. Is it possible to convert it into qza to use it in qiime2?

colinbrislawn · October 4, 2022, 1:16pm

Yes, yes you can. And we have a plugin that will help you prepare your database for use with Qiime2.

Not only will that make you fasta file a qza file, it will process it and benchmark it so you know how well it will work!

microbiotaphyto · October 4, 2022, 1:31pm

Thank you!!

I have been reading qiime2 docs to learn about taxonomic classification and I got to this link that provides a few classifiers Data resources — QIIME 2 2022.8.3 documentation

The thing is, I don't understand the difference between the different ones: there's Naive Bayes at the beggining, then weighted classifiers, marker gene reference databases.
Can anyone help me understand in what they differ?

Also, I see there are "pre-formatted SILVA reference sequence and taxonomy files processed using RESCRIPt". But what is the difference between a reference sequence file and a taxonomy file?

I thought I just had to get one classifier that works for my data (V4 region), but now I'm super confused about all those kinds of classifiers

microbiotaphyto · October 4, 2022, 1:48pm

PS.: I had previously tried to do a taxonomic classification which generated a taxonomy.qzv, but the IDs in this output file didn't relate the identified taxa to the read where it was identified.

An example of an ID is 4178779a98f7fe74a58952d9dbba5434 - I can't tell what it means or to which fastq or which read it's related.

Now that I see so many kinds of files involved in the taxonomic classification, I know I probably did it wrong... Can anyone help me with that, please?

SoilRotifer · October 4, 2022, 2:52pm

Hi @microbiotaphyto,

You can use either sets of these classifiers. Many simply use the standard Naive Bayes classifiers. It does not hurt to try the weighted classifiers and see if they help improve taxonomy assignment. You can read more about the utility of weighted classifiers from this manuscript.

In brief, the standard classifiers assume that all species in the reference database are equally likely to be observed. The weighted classifiers attempt to address this issue by incorporating environment-specific taxonomic abundance information to improve taxonomic classification. If your samples come from one of the 14 environments studied in the manuscript, then these classifiers may help.

These are simply the FASTA and Taxonomy files that have been processed through RESCRIPt, generally following the approach as described in the tutorial that @colinbrislawn linked to earlier. These are the files used to make the classifiers. We provide these files so users can leverage other taxonomy assignment methods, such as classify-consensus-blast or classify-consensus-vsearch

Often users opt to make and use an amplicon specific classifier (i.e. V4 for their data), to improve taxonomy assignment (Werner et al., Bokulich et al.). But it is perfectly fine to use the classifier trained on the full-length sequence data. You can always try both. But, I'd suggest simply using the pre-made V4 classifier to start with.

If you would like to look at the sequence associated with this ID, you can cross reference the ID by generaating a QZV using the qiime feature-table tabulate-seqs command.

microbiotaphyto · October 4, 2022, 6:03pm

That shows me the Feature ID, the Taxon and the Confidence, right? But how can I relate this Feature ID to the fastq/sample it came from? Sorry if I'm getting it all wrong, I think the whole idea is still confusing to me.

microbiotaphyto · October 4, 2022, 6:24pm

qiime taxa barplot

This generates an interactive barplot that allows me to visualize the taxonomy and frequency in each sample. Can I get this in a table instead of a qiime view interactive plot?

SoilRotifer · October 4, 2022, 10:26pm

There are a few ways to investigate this...

First, you can create a visualization of your feature-table with qiime feature-table summarize, and click on the 'Feature Detail' Tab at the top. This will list all the features and provide how many samples and how many times each feature appears in your data set. You can view the example on the QIIME 2 View page.
Secondly, you can run the following command generate a tabulated view of your feature-table:

qiime metadata tabulate \
    --m-input-file table.qza \
    --o-visualization table.qzv

This will show you the features, as columns, and the number of times they appear in each sample. You can click on the "Download metadata as TSV file" tab if you'l like to view in Excel, etc... :box
Bonus tip: You can combine this tabulation step with the qiime taxa collapse command. That is, collapse your features (i.e. combine them if they have the same taxonomy) first, then feed that output into the tabulate command. This will provide taxonomy, as columns:

qiime taxa collapse \
    --i-table table.qza \
    --i-taxonomy taxonomy.qza \
    --p-level 7 \
    --o-collapsed-table table-l7.qza

qiime metadata tabulate \
    --m-input-file table-l7.qza \
    --o-visualization table-l7.qzv

Optionally, you can transpose the feature table before making the visualization with qiime feature-table transpose.

Finally, if you would like a table along with it's taxonomy, you can export the table and append the taxonomy (i.e. old QIIME 1-like format) by following the instructions referenced here, or here.

Does this help?

-Cheers!
-Mike

microbiotaphyto · October 5, 2022, 12:30pm

I believe this is what I need!

I need a table with taxonomy as rows, the samples as columns and the read counts as intersections. Can I get this with this tutorial Link BIOM table with taxonomy - #3 by Ghaz? Or to do this one I need a BIOM table first (no ideia how to use biom)?

This is what I have tried:
I ran qiime taxa barplot, which gave me a QZV of interactive barplots. From that, I downloaded a CSV with samples as rows and taxonomy as columns. I transposed that CSV using Pandas to change taxonomy into rows and samples into columns. It seemed to have worked out.

BUT the taxonomy is written in an ugly way, like this example:
k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__ovatus

Is there a way to make this more readable?
Do you have any thoughts on the way I did it?

SoilRotifer · October 5, 2022, 1:41pm

Yep this is essentially the same output as 3.

But this is the taxonomy. You can use qiime rescript edit-taxonomy, to edit / truncate the taxonomy strings prior to using this for making your intended output.

This is exactly what option 2 will provide you when you run in the order I suggested:

qiime taxa collapse...
qiime feature-table transpose ...
qiime metadata tabulate ...

Not only does this provide the visualization, but then you can download the TSV file from this visualization.

microbiotaphyto · October 5, 2022, 9:08pm

Sorry, I didn't understand it before. It worked out perfectly! Thank you!!

I still don't understand what this does tho. I ran your suggested option 2 with and without collapsing and it seems to have resulted in the same counts, the same final table.

SoilRotifer · October 5, 2022, 9:51pm

Awesome!

Normally a feature table contains ESVs / OTUs. Many of these features (e.g. typically with odd-looking feature-id like 4178779a98f7fe74a58952d9dbba5434) can be different sequences, but contain the same exact taxonomy (i.e. may only differ by a few nucleotides). Running the collapse command essentially combines all of the features with same taxonomy into one unit. I guess, in this case, you can call is a "taxonomy-feature". Anyway, The individual features for a given taxonomy have their counts summed together. So, the total feature and sequence count are not going to change. Also, a feature-id must be unique.

For example, we have this feature table:

feature-id	sample-1	sample-2	sample-3
feature-A	15	0	12
feature-B	1	15	1
feature-C	0	18	22

Let's assume features A & B have the same exact taxonomy. When we run the collapse command on feature-table.qza and its associated taxonomy.qza we'll have a new table that looks like:

feature-id	sample-1	sample-2	sample-3
d__Bacteria;p__Firmicutes;...	16	15	13
d__Bacteria;p__Bacteroidetes;...	0	18	22

You can now see that feature-A and feature-B were combined under their identical, and unique, taxonomy string (d__Bacteria;p__Firmicutes;...), and their sample counts summed. This is essentially what the taxonomy barplots are showing, except in the form of percentages. But here, we are keeping the counts as they appear by taxonomy and not the individual features.

Does this make sense?

Also, notice what information you might be losing when viewing through the lens of taxonomy and not unique sequences (i.e. ESVs). That is, the ESVs might reveal different patterns compared to lumping data by taxonomy. Check out this paper from @jwdebelius for example.

system · November 6, 2022, 3:52am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.