I'm using the Greengenes2 database for sequencing results of the V3-V4 region. After utilizing the greengenes2.actions.non_v4_16s function, a significant reduction was detected in the number of features (from 49,583 to 3,936). Is this reduction merely a typical result by closed reference OTU picking?
In addition, the percentage of matching unique query sequences is very low in the process of q2-vsearch using greengenes2.actions.non_v4_16s function.
494630940 nt in 331269 seqs, min 416, max 4563, avg 1493
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 12909 of 49583 (26.04%)
[After]
15259057 nt in 36674 seqs, min 283, max 444, avg 416
Getting sizes 100%
Sorting 100%
Median abundance: 3
Writing output 100%
Could this indicate a potential problem with my data?
If there might be a problem, how can I check whether my data is problematic?
To clarify, is that 49.5k ASVs to ~3.9k features in the backbone? On the surface that doesn't seem surprising. How many total sequences are retained, and what environment are the samples from?
The above message was output after mapping to the greengenes2 backbone (2022.10.backbone.full-length.fna.qza). But, I'm not surely understand the output message.
In FeatureData[Sequence], the number of sequences changed from 49,583 to 3,936 after mapping.
In FeatureTable[Frequency], total frequency changed from 51,393,839 to 49,595,531.
Sequences were derived from human oral and intestinal samples.
If I didn't convey the information you wanted to know properly, please let me know again.
Thanks.
That output I believe comes from q2-vsearch, I don't recall if it is controllable with --verbose or not off hand.
It looks like 96.5% of the sequence reads are retained. That's not bad for closed reference. The reduction in the number of features is stemming from similar ASVs grouping to the same backbone sequence.