One more question: since I defined --p-trim-length 150 for both for database and representative sequences trimming, does it mean that I am only using the first 150 nucleotides for both query and subject?
These columns all describe the reconstructed database ID.
Let me see if I can give you a data dictionary
The total number of regions covered by the reconstructed region
The number of database kmers mapped to the final resolved sequence
The mean number of kmers per region.
The standard deviation in the number of kmers mapped per region. If only one kmer was mapped, this has a value of 0.
All ampicon sequence variant identifiers mapped to the database. (I'm not sure if this is still useful, since you can theoretically reconstruct in a computationally effecient way without needing the ASV maps, but eh, its there if you need it)
Yes, if your trim length is 150, then it's the first 150nt from the end of the primer.
I still have some questions. Let's take into account the following example:
My analysis was based on 6 regions.
If I have understood correctly, the sequence AAJO01000148.739.2249 (subject in the reconstructed database) is composed of 1 kmer. Also, this 1 kmer has been observed in the 6 regions (the sequences were of course different, but the ID remains the same for each region). Given this, how is it possible that the mean-kmer-per-region is 1,6666..7? Or how is it possible that the total-kmers-mapped is higher than 6 since only 1 kmer is present in the subject iD?
Does this mean that the shorter the kmer size chosen, the less information you are using for each representative read as query? Could it be possible to trim/cut the representative sequences in consecutive (even shorter) kmers?
There is a flag that lets you count degenerates as unique kmers. If you counted degenerate sequences as unique kmers, then you might have one region with a kmer that contains a degenerate and would resolve to a single database sequence but map two kmers. You could also have a region where 2 database sequences are the same, but once you resolve the full length sequences, they belong to different parents.
That would be a different algorithm, closer to something like kraken for some of the other kmer-based algorithms. I'm less familiar with those, but there are definately publications about them.
But generally yes, if you cut your reads shorter you'll get less information in the alignment. You might have a case where a longer region (say 150nt instead of 100nt) might be able to resolve a taxon or correct your alignment. You also potentially have more accuracy or more error tolerance if you pick a longer read.