Hello Akiriti,
I was running into this issue as well after following the same steps listed [there] (Lefse after QIIME2) and collapsing the table down to level 6
If I am not mistaken (would love to have a moderator/admin/power user verify or correct my understanding), the problem for us stemmed from when representatives sequences were not able to be resolved with enough specificity, it would simply group those rep seqs, determine the relative abundance, and label those to the most specific one possible (next taxonomic level up), and create a new row with that label. Which for us, was not an issue for the most part as we interpreted it the same way with our external analysis.
For example:
1 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Corynebacterium
2 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Corynebacterium 1
3 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Lawsonella
4 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae
In this case, the row ending in “|Corynebacteriaceae” (Row 4) would only include the relative abundances of those rep seqs that could not be resolved past the Corynebacteriaceae family and does not include Rows 1-3. But the issue is that LEfSe does not interpret those rows in the same manner - instead, it assumes that the “|Corynebacteriaceae” row (Row 4) contains the relative abundance values of the entire Corynebacteriaceae family (including Rows 1-3) regardless of further specificity. This is where we ran into issues.
But what we also noticed was that when taxa were resolved to the same levels or at least one level more specific than the target comparison taxa level, there were not any issues. So using the example above, even though LEfSe was misinterpreting Row 4 as total cumulative relative abundances for the Corynebacteriaceae family, LEfSe did not have issues identifying which relative abundances were corresponding to the Corynebacteriales order as we did not have unresolved rep seqs at that level and thus no row that ended with “|Corynebacteriales”. So we assumed it was able to extrapolate that all relative abundance rows that contained “|Corynebacteriales|” (Rows 1-4 in this example) belonged to the Corynebacteriales order and it ran those values through the LEfSe analysis pipeline correctly.
Thus, our proposed solution was to simply add an additional taxa level label of “|Unknown” to all rows that were not resolved to the L6-Genus level. That way there were no rows that LEfSe was interpreting as cumulative relative abundances for a specific taxa level and instead would be forced to determine that value itself. Additionally, “|Unknown” would not be detected unless those ungrouped rep seqs had a significant enough relative abundance to pass the threshold to begin with.
For example:
1 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Corynebacterium
2 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Corynebacterium 1
3 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Lawsonella
4 Bacteria|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Unknown
By doing this, we increased the number of discriminate features detected by LEfSe above our LDA threshold from 11 to 14 for our dataset.
Again, this seemed to have solved our LEfSe issues for testing differences at all levels 1-6 but would like to run this by the community to see if this solution is logical and would not have significant issues.
Thank you,
Daniel