Sidle- reconstruct-database

I hope @jwdebelius gets to see this post.

If you did: thanks for developing the implementation of Smurf in qiime2. You are amazing!

I have ran the whole Sidle pipeline following the tutorial in q2-sidle.readthedocs.io with my own data and everything went ok.

However, I have some trouble understanding the output of qiime sidle reconstruct-database command.
Can you tell me what the following columns stand for?

num-regions, total-kmers-mapped, mean-kmer-per-region,stdv-kmer-per-region,mapped-asvs

How do you calculate total-kmers-mapped?

One more question: since I defined --p-trim-length 150 for both for database and representative sequences trimming, does it mean that I am only using the first 150 nucleotides for both query and subject?

Thanks a lot for your time!

best,

Belen

1 Like

Hi @mbcarbonetto,

Thanks! It's been kind of fun!

These columns all describe the reconstructed database ID.

Let me see if I can give you a data dictionary

column name data type description
num-regions int The total number of regions covered by the reconstructed region
total-kmers-mapped int The number of database kmers mapped to the final resolved sequence
mean-kmer-per-region int The mean number of kmers per region.
stdv-kmer-per-region int The standard deviation in the number of kmers mapped per region. If only one kmer was mapped, this has a value of 0.
mapped-asvs str All ampicon sequence variant identifiers mapped to the database. (I'm not sure if this is still useful, since you can theoretically reconstruct in a computationally effecient way without needing the ASV maps, but eh, its there if you need it)

Yes, if your trim length is 150, then it's the first 150nt from the end of the primer.

Best,
Justine

3 Likes

Dear @jwdebelius ,

Thanks for your quick reply.

I still have some questions. Let's take into account the following example:

feature-id num-regions total-kmers-mapped mean-kmer-per-region stdv-kmer-per-region mapped-asvs
AAJO01000148.739.2249 6 7 1,16666666666667 0,408248290463863 0653116a0ffcada3677dd35822398b0f

My analysis was based on 6 regions.

If I have understood correctly, the sequence AAJO01000148.739.2249 (subject in the reconstructed database) is composed of 1 kmer. Also, this 1 kmer has been observed in the 6 regions (the sequences were of course different, but the ID remains the same for each region). Given this, how is it possible that the mean-kmer-per-region is 1,6666..7? Or how is it possible that the total-kmers-mapped is higher than 6 since only 1 kmer is present in the subject iD?

Regarding

Does this mean that the shorter the kmer size chosen, the less information you are using for each representative read as query? Could it be possible to trim/cut the representative sequences in consecutive (even shorter) kmers?

Thnaks a lot again.

Best,

Belen

Hi @mbcarbonetto,

There is a flag that lets you count degenerates as unique kmers. If you counted degenerate sequences as unique kmers, then you might have one region with a kmer that contains a degenerate and would resolve to a single database sequence but map two kmers. You could also have a region where 2 database sequences are the same, but once you resolve the full length sequences, they belong to different parents.

That would be a different algorithm, closer to something like kraken for some of the other kmer-based algorithms. I'm less familiar with those, but there are definately publications about them.
But generally yes, if you cut your reads shorter you'll get less information in the alignment. You might have a case where a longer region (say 150nt instead of 100nt) might be able to resolve a taxon or correct your alignment. You also potentially have more accuracy or more error tolerance if you pick a longer read.

Best,
Justine

Thanks a lot for your reply! It is much more clear now.

best,

Belen