Now I am in the taxonomy assignment step. Do you have any tips or link to tutorials in which classify-consensus-vsearch is used?

alexcussigh0 · October 14, 2021, 9:07am

Thank you for the reply!
Finally I managed to demultiplex my sequences with cutadapt and then I've imported them and denoised with dada2!

Now I am in the taxonomy assignment step. Do you have any tips or link to tutorials in which classify-consensus-vsearch is used?

I don't know in which format and how to import the reference database and the taxonomy labels.

Thanks for the tips!

colinbrislawn · October 15, 2021, 9:47pm

Hello!

Check out the documentation for classify-consensus-vsearch. It's not a full tutorial, but you can follow the other tutorials and use this plugin instead when you get to the feature-classifier step.

You could take a look at this test data on GitHub, or the RESCRIPt tutorial which discusses the database formats in great detail.

Let us know if you have any questions,

alexcussigh0 · October 18, 2021, 1:36pm

I was wondering if there is a way to retrieve the FeatureData[Taxonomy] artifact from a multifasta with accension number and taxid without downloading all the sequences again using rescript.
My multifasta/database looks like this:

(>)KT715809 count=34; merged_taxid={1732171: 3, 143503: 2, 106584: 3, 57301: 3, 323800: 3, 163617: 3, 163619: 3, 136868: 3, 323764: 3, 316406: 3, 137271: 5}; species_name=###; family=8113; family_name=Cichlidae; scientific_name=Haplochromini; reverse_match=CATAGTGGGGTATCTAATCCCAGTTTG; taxid=319058; rank=tribe; forward_error=1; forward_tm=55.08; genus_name=###; forward_match=GCCGGTAAAACTCGTGCCAGC; reverse_tm=58.71; genus=-1; reverse_error=0; species=-1; strand=D; Fossorochromis rostratus mitochondrion, complete genome
caccgcggttatacgagaggctcaagttgatagacatcggcgtaaagggtggttaggaaa
tttttaaactaaagccgaacgccctcagaactgttatacgtacccgagagcaagaagccc
cactacgaaagtggctttatacccccgaccccacgaaagctgcgaaa
(>)AB250108 count=4; merged_taxid={143612: 4}; species_name=Biwia zezera; family=2743714; family_name=Gobionidae; scientific_name=Biwia zezera; reverse_match=CATAGTGGGGTATCTAATCCCAGTTTG; taxid=143612; rank=species; forward_error=1; forward_tm=55.08; genus_name=Biwia; forward_match=GCCGGTAAAACTCGTGCCAGC; reverse_tm=58.71; genus=143611; reverse_error=0; species=143612; strand=D; Biwia zezera mitochondrial DNA, complete genome, country: Japan:Shiga, Moriyama, Lake Biwa
caccgcggttaaacgagaggccctagttgatactactacggcgtaaagggtggtttaggg
aggaaaataataaagccaaatggccctttggccgtcatacgcttctaggtgtccgaagcc
caacccaacgaaagtagctttagtaagacccacctgaccccacgaaagctgagaaa

Thank you for the help!

colinbrislawn · October 18, 2021, 1:55pm

Thanks for showing an example of the data you have.

Can you help me understand the format you want your data to be in? Do you just want two columns like this?

(>)KT715809 taxid=319058;

alexcussigh0 · October 18, 2021, 4:09pm

I want to write the taxonomy file for the taxonomic assignation in the format
AccesionNumber Order;Family;Genus;Species

Starting from a multifasta in which I have the AN and the taxid in the header sa I posted above.
I will use this multifasta also ad my reference database.

colinbrislawn · October 18, 2021, 5:01pm

I think I can help you make a list of just your Accession Numbers (AN) from that fast file, but we will have to use a different taxonomy database, say from NCBI, to get Order;Family;Genus;Species.

Do you have a database source already downloaded, or are you asking for help finding one?

alexcussigh0 · October 19, 2021, 8:32am

I have already created a database with 12s from vertebrates. I have downloaded all the VRT seq from embl and than performed a ecoPCR with my metabarcoding primer. So I have this multifasta containing all unique 12s sequences annotated as above.

I need to create the taxonomy file starting from this fasta file in which all headers start with the AN and there is also the taxid reported as taxid=XXXXX.

SoilRotifer · October 19, 2021, 3:41pm

Hi @alexcussigh0,

Check out our RESCRIPt 12S db tutorial. It's just a simple example of what you can do, but it should help you get started...

Although it means re-downloading everything again, and potentially in batches by taxonomic group. But everything will be formatted properly.

The notebook specifically looks for records of 12S reads, but you can perform a separate search that downloads only genome records and then use feature-classifier extract-reads to extract the ampicon region from those genomes. Then you can merge with the other batches of data.

Anyway, I just wanted to provide another option that you can try running in parallel.

-Mike

system · November 19, 2021, 9:41pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.