This might be common knowledge already, but I wanted to articulate why I think close-ref OTU picking could be a good fit for multi-region-studies
Three high-level strategies for defining OTUs… are canonically described as de novo, closed-reference, and open-reference OTU picking… Each of these methods has benefits and drawbacks.
In closed-reference OTU picking, input sequences are aligned to pre-defined cluster centroids in a reference database. If the input sequence does not match any reference sequence at a user-defined percent identity threshold, that sequence is excluded. (peerj, 2014).
This is essentially ‘counting database hits’ so
resulting OTUs are 100% biased by the database
resulting OTUs are 100% consistent with the database
resulting OTUs are literally just the ones from the database
Modern ASV methods aim to be just as consistent without introducing database bias, but for this project we are knowingly using this strong bias to normalize across regions.
Thanks for clarifying @colinbrislawn! I agree, and I also generally avoid closed-ref OTU clustering in the modern era if I can, but this ion torrent kit seems like one of the special case where I think the strengths (collapsing disparate amplicons) could outweigh the weaknesses (database bias, reduced resolution vs. ASVs) so there are times when I still use and advocate closed-ref OTU clustering.
One thing to note: we have known for a long time that OTU clustering on its own leads to inflated diversity estimates and needs to be paired with other filtering (or denoising!) methods to reduce those errors, and closed-ref OTU clustering on its own suffers from the same issues.
However, in this case you are using closed-ref OTU clustering after denoising. So the contents of that flaming review do not really apply here… erroneous sequences are being filtered/corrected by your denoising method of choice, then closed-ref OTU clustering is being used strictly to “collapse” the ASVs into full-length 16S sequences, not as a pseudo-error-filtering method. This should all still be benchmarked to see how this performs for this ion torrent kit (@cjone228’s mock communities will enable that endeavor!) but that review should not discourage this analysis.
Hello to all,
I’m trying this pipeline on my data. This is the step (as suggeste above):
1- Import data
3- dada2 denoise-pyro
4- qiime vsearch cluster-features-closed-reference
5- qiime fragment-insertion sepp
In the last step (5) I had an error message…
This is the script that I used:
qiime fragment-insertion sepp
@cjone228 and I have another couple of points we would like clarified:
Our fastq files contain single-end mixed orientation reads (both forward and reverse). We imported our data using SingleEndFastqManifestPhred33V2 according to the QIIME2 the Importing Data Document. However, we recently noticed that the Importing Data Document states “In this variant of the fastq manifest format, the read directions must all either be forward or reverse.” Is there another way we should be importing our data? Or is the only solution to re-orient our reads or split based on direction prior to importing?
In the event that we are able to import our fastq files as-is (i.e. in mixed orientation), we wanted to clarify whether or not DADA2 can handle mixed-orientation reads? (Based on what we read we don’t think that it can…).
So, overall we are just trying to clarify whether it is inevitable that we will need to split our reads by direction at some point or another.
P.S. @rparadiso - you are actually a step ahead of us, so we don’t have an answer to your question! Hopefully someone else has some insight for you
I am not 100% certain of the semantics there, but I think the point was that reads should not be pre-joined if they are imported in that format…
Your reads are all forward or reverse because in this case F/R mean the read direction on the sequencing instrument, not the orientation respective to the genome (which is mixed in this case).
So you are doing the right thing, and this is the correct format.
dada can handle mixed-orientation reads (respective to the genome), that is not a problem technically speaking. But mixed F + R reads and pre-joined reads will cause issues.
So again you are doing things correctly.
The only issue I can think of for mixed-orientation reads and dada2 is that you will get unique ASVs for reads from the same genome that are in reverse orientations. But in theory that is not a dada2 problem, it is an alpha diversity problem! (as I think we’ve discussed above but this topic is so long I can’t remember anymore).
The issue is that you do not need SEPP, and should not use SEPP here. When you use closed-reference OTU clustering, the features are no longer ASVs that need to be aligned/spliced into a new phylogeny. The features are now the matching reference OTUs, and you adopt the reference phylogeny.
See this issue for more details on why you should not use SEPP after closed-reference OTU picking:
So what should you (and everyone else who wants to use this pipeline) do instead? You should use the reference trees that ship with your reference database of choice (e.g., in your case use the greengenes 99% OTU reference tree since you used that same database for clustering with vsearch).
Thanks everyone! I feel like we’re making a lot of progress!
Thanks Nicholas for your correction.
Now I’m a little confused about how to continue…
I have to download the tree from Greengenes’ database and then how do I continue?
Can I perform in the next step the core metrics?
You are correct - we did cover this earlier in the post
Thinking a little further ahead, if we import and perform dada on our genomic-mixed-orientation reads, do you foresee that the reverse reads will have problems aligning to the reference database when doing closed-reference OTU clustering? (i.e. - will we lose half of our data?)
We may have just answered our question about closed reference OTU clustering - in the QIIME2 documents for closed reference clustering of features, the parameters section has the following option:
--p-strand TEXT Choices('plus', 'both')
Search plus (i.e., forward) or both (i.e., forward
and reverse complement) strands. [default: 'plus']
We assume if we choose 'both' that would allow for mixed orientation reads?
Next, we were looking for the greengenes reference database for the OTU clustering step. We could not find the 13_8 release on the gg website. We found this post and downloaded the file. Is this is the correct file to use?
Since this is not strictly relevant to this topic (it is an issue with that specific reference tree and, e.g., you could use a different reference tree), do you want to open up a separate topic to solve your q2-fragment-insertion issue? If that solution is relevant to the current discussion, we can link back to that here.
@Lauren and I have been trying to troubleshoot this too, but with no luck so far. We got the same error as you when we repeated what you did, and also when we tried the same code but after importing the XX_otus_unannotated.tree.
We aren't sure how to use Python to fix the branch length issue as shown in the post that @Nicholas_Bokulich linked to - our computing cluster does have Python built in but we haven't been able to figure out how to actually use it yet .
We look forward to your thread on this and will plan to chime in there!
Hi everyone, @rparadiso was able to solve the greengenes branch length issue in a separate topic — see here for a few different solutions:
@cjone228 that topic lists some other options — opening and manually modifying the file, or running a python script in the bash shell (command line), so there are a few options to suit whatever you feel most comfortable working with.
Please post to that topic if you have any follow-up questions or run into any issues with fixing your tree(s).