Hi @fulbag!
I saw that @Nicholas_Bokulich sent a reply to you just as I was typing a reply as well
This is a great question! Technically, you can use these fastq files as input (just to clarify, these are the fastq files that you get directly from your sequencer / server - not out of Ion Reporter). And those are exactly the files that Carli is using as input in her first post in this thread:
However there are a couple of tricky things going on that complicates data interpretation:
- Because there are multiple variable regions being amplified from the same sample at the same time, reads from the same bacterium but different variable regions will be interpreted as "different" bacteria even though they come from the same source (I think this is a similar problem as differences in 16s copy number between bacteria but in a different way). I believe this is what SMURF is trying to overcome...
- The other issue is that comparing sequences from different V regions is difficult.
@colinbrislawn recently gave me the following explanation:
Counting is the hardest part of bioinformatics. If you count each region separately, you will be underestimating richness in some regions, and double (or tripple!) counting ASVs in others.
Most (all?) stats methods in this field presume that you are measuring a single source of diversity:
So you count the number of unique s in each state
or you count the number of unique s in each national park
or you count the number of unique s in each watershed
All of these metrics make sense on their own.
(Total unique in Montana < Idaho, p=0.02)*
But some national parks are shared by several states! And watershed often always cover multiple states!
So while you have measured three things well, combining them is hard, if not impossible
…unless…
instead of states and parks and watersheds, you have a geolocation of each point.
Now you have a unified way to compare all s.
And this unified method can scale beyond the geographical constructs of the US!
The states, parks, and watersheds are your different sequences regions. But what you really need is something to unify all three.
And I think you can bring your s together with a
Therefore, if you use the consensus fastq files as input, we think you are limited to core metrics analysis that relies on phylogenetic analysis (e.g. Faith's PD for alpha diversity and Unifrac for beta diversity)
Our goal for identifying sequences (and primer sequences) from the the different V regions is that if there was some insurmountable problem in performing the consensus analysis pipeline, then we would be able to at least pick sequences from one variable region to analyze our data. And also, if we discovered that there were some better way to analyze the data then everyone could start doing it!