Hi @asr17,
What format are you sequences in at the moment? If they are genbank records, we'll need to convert them to fasta first, otherwise you should be in good shape.
Our goal is to create a FeatureData[Sequence]
artifact and a FeatureData[Taxonomy]
artifact. You are already halfway there on the FeatureData[Sequence], but you will probably want each sequence to have only the accession ID as the FASTA ID. Depending on if your seqs are from GenBank or RefSeq your data might look like this:
>KY676659.1 Eurycea nerea isolate MO42 cytochrome oxidase subunit 1 (CO1) gene, partial cds; mitochondrial
CCCCTCTTCTCCGGATTTACCCTCCACCCAACATGATCNAAAATCCACTTCGGAGTAATGTTTATTG
...
Which is perfect as the GenBank ID is the FASTA ID (everything after a space is the description) meaning you can import this without issue.
Or it might look like this example, which has some colons indicating the range of the genome we're looking at:
>NC_006922.1:5177-6724 Pogona vitticeps mitochondrial DNA, complete genome
ATGTCAACCATAAACCGATGACTACTATCCACAAACCACAAAGATATCGGAACCCTGTATTTCCTATTTG
...
We don't want that range in our ID as it will make cross-referencing harder.
Your sequence IDs might also have pipe characters in it |
(although I might be mis-remembering as I can't find an example at the moment).
Once you confirm that your FASTA file looks good and has the ID you want, you'll be able to import your data like in this example.
Otherwise the real trick is going to be creating the FeatureData[Taxonomy]
. You'll need to cross-reference your genbank/accession IDs to find the associated taxon IDs, from there you'll need to make a spreadsheet/TSV mapping the genbank/accession ID to the taxonomic string you want to see.
I think Entrez is going to be your friend in this endeavour, as you'll need to perform a lot of queries. Here is a command line tool for querying and fetching data from Entrez.
Once you have a mapping of IDs to taxonomy strings, you'll want to import that. The q2-feature-classifier tutorial has an example of importing taxonomy. Pay attention to the source-format, right now your options are:
- TSVTaxonomyFormat which has this header:
Feature ID<tab>Taxon
on the first line
- HeaderlessTSVTaxonomyFormat which has no header.
Otherwise they both are just a TSV file with IDs, tab, then taxonomy string (usually delimited by ;
).
From there you could continue that tutorial and train a classifier, or perform other analysis as you see fit.
Let me know if that all makes sense and if anyone else has suggestions for streamlining the process, please jump in!