Using genome-sampler for my own dataset

I would like to subsample a large set of genomes with genome sampler.

I have a few questions about the metadata format because I am using my own database. I would like to know the minimum information needed for the metadata.

At the moment, I can only provide the sampling time and location. Additionally, my sequence sampling only includes the year information. Given this, I am unsure how to set up uniform sampling based on time, especially since the tutorial mentions 7 sequences over 7 days, which does not apply to my situation.

Could you please provide guidance on the necessary metadata format and how I might approach the uniform sampling setup in my case?

Thank you for your assistance.

Hi @zsggq2006,
Thanks for your interest in genome-sampler!

First, just for your information, we created a new genome-sampler release for the first time in a few years last week, so it may be worth updating if you're using an old version. More information on this here.

I would like to know the minimum information needed for the metadata.

You can find the format described here. Technically speaking, there is no metadata required beyond the identifier column.

Additionally, my sequence sampling only includes the year information.

That should technically still work, as long as you have the four digit year included in the column that you're providing as the dates column to sample-longitudinal. Try providing something like the following to your sample-longitudinal command:

--p-samples-per-interval 10 --p-days-per-interval 365

This should select 10 samples per year, and you can change the 10 to whatever value you'd like.

As far as I know, no one has used genome-sampler this way, but I did some testing to confirm that this will work and it looks like it will. Let me know if you run into problems.

Also, note that genome-sampler was designed to work on viral genome sequences (SARS-CoV-2 specifically). If you're working with genomes that are longer than about 23k bases, the sample-diversity command almost certainly won't work for you.