Great Questions, KKN!
Thanks for the detailed post. In future, you might consider opening a separate post for each unrelated question, just to make them more digestible. (Question 3 should probably be a standalone in future, because it's very distinct from the others.)
There's a great discussion of those semantics here. In short, greengenes contains features that are annotated like Feature 1, with "empty" annotations at lower taxonomic levels. If your match looks like Feature 1, it means you got a species-level match with a feature that Greengenes has only annotated to the genus level. If your match looks like Feature 2, I believe that means you only got a match to the genus level. Clear as mud? Complicated semantics like this are one challenge with allowing "empty" taxonomic annotations.
Unless I'm gravely mistaken, QIIME 2's taxonomy barplots plot taxon frequency per sample, and don't directly treat Amplicon Sequence Variants (ASVs). As you mentioned, one taxon may represent multiple ASVs. My neighbor and I are both Homo sapiens sapiens (AFAIK), but we have distinct genomes. A barplot of neighborhoods in my town would group my neighbor and I in ...g__Homo.s__sapiens
, because it is plotting the frequency of taxons per sample, not the ASVs per sample.
This makes me nervous. Without digging into the source code, I'm not sure exactly what's going on, but I'm uncomfortable with the semantics here. ASV1 != ASV2, even if they happen to have been classified the same way. If your downloaded CSV is actually grouping ASVs by taxonomic annotation but labeling them with FeatureIDs rather than labeling them taxonomically, that might be worth reporting as a bug. Do you have the ability to confirm whether that's what's going on?
Perhaps, but this might be overgeneralizing - if your classified 16s data consistently has good annotation to the species or subspecies level in your database, why not use it? If it doesn't, you'll have to make a judgement call.
Generally, these taxonomic bar plots are intended as high-level diagnostic tools. You probably want to consider dedicated differential abundance tools (ANCOM could be a good starting place) if you're interested in understanding differential abundance across samples.
Are your repseqs a dada2 result? If so, you probably haven't used a database at that point in the analysis, and your FeatureData[Sequence] contains all unique ASVs produced during denoising, with corresponding FeatureIDs.
CK