Using genome-sampler for my own dataset

Hi @zsggq2006,
Thanks for your interest in genome-sampler!

First, just for your information, we created a new genome-sampler release for the first time in a few years last week, so it may be worth updating if you're using an old version. More information on this here.

I would like to know the minimum information needed for the metadata.

You can find the format described here. Technically speaking, there is no metadata required beyond the identifier column.

Additionally, my sequence sampling only includes the year information.

That should technically still work, as long as you have the four digit year included in the column that you're providing as the dates column to sample-longitudinal. Try providing something like the following to your sample-longitudinal command:

--p-samples-per-interval 10 --p-days-per-interval 365

This should select 10 samples per year, and you can change the 10 to whatever value you'd like.

As far as I know, no one has used genome-sampler this way, but I did some testing to confirm that this will work and it looks like it will. Let me know if you run into problems.

Also, note that genome-sampler was designed to work on viral genome sequences (SARS-CoV-2 specifically). If you're working with genomes that are longer than about 23k bases, the sample-diversity command almost certainly won't work for you.