I am working with 16S amplicon data on faecal samples for gut microbiome composition analysis.
I have processed my samples using the workflow in the moving pictures tutorial:
Demultiplexed
DADA2 - denoising/merging/filtering - ASVs
Classification
Phylogenetic tree construction
I have then done the rest of my analysis in R using phyloseq, vegan etc. I am finding there are a few outliers and some abundant sequences that are not classified at genus level - but are coming up as dominant in some of my treatments.
My supervisor has recommended further clustering my ASVs into OTUs at 98% similarity. I am looking at using vsearch cluster-features-de-novo to cluster these sequences at a 98% similarity threshold.
I can't seem to find any way to specify how the input sequences are processed and how centroids are determined - eg:
User supplied order
Pre-sorted based on length
Abundance sorted
From the QIIME2 documents it seems the only thing you can adjust is the percent identity threshold (eg. 0.98).
Are there other ways you can adjust the clustering method to specific the way input sequences are processed etc? I would want to do this based on the most abundant sequences as these are likely significant in the population.
I tried to read through the forum and couldn't seem to find any comments/discussion on this.
I have used that tutorial and was able to successfully cluster my ASVs. However, I was just curious to know more about how this command determines centroids for clustering (whether it picks based on length, abundance etc) and if there was a way to alter the way this is determined.
In the vsearch literature it describes these 3 options:
User supplied order
Pre-sorted based on length
Abundance sorted
I’m just curious to know what the default for determining centroids is within that command in QIIME2 (eg. cluster-features-de-novo)
I also just wanted to see if anybody knows a way you can go back to ASV level from you OTUs? Is there a way to track back and know which ASVs were mapped to which OTU?
I know these aren't massively different, but they are different all the same. Can anybody tell me why they might be different and how I might address this?
Hi! I'm currently attempting to cluster rep seqs from DADA2 output, and I'm wondering if the rep seqs need to be trimmed to equal length prior to clustering. I was reading on some of the other qiime2 material that sequences should be trimmed to equal length prior to clustering, so I'm just trying to make sure I'm proceeding appropriately.
In my experience, I've not had issues with variable length genes, i.e. ITS, trnL, etc... when using DADA2. See the ITS workflow from the DADA2 developers.
From their toturial:
Please ignore all the “Not all sequences were the same length.” messages in the next couple sections. We know they aren’t, and it’s OK!