Hi QIIME2 community,
I'm banging my head against the wall trying to find a way to deal with my non-oriented amplicon sequencing data. The way the libraries were dual indexed, i have reads oriented in both directions (forward and reverse complement) and after running the data through Deblur, I'm left with reverse complement duplicates of ASVs which is inflating the number of ASVs by ~2x.
I've scoured the forums and found two threads dealing with the same issue (DADA2 with non-oriented library, and Any other option for doing the feature table without trimming from PGM data?), however, solutions proposed there just don't work for my case. The data I have is merged paired-end reads (merged previously by colleagues using PEAR) and the primer sequences have been trimmed off already. I don't have access to the original raw fastq, so re-orienting the reads based off of the primers isn't an option.
I have thought about doing an all-by-all comparison of the ASVs output by Deblur in order to determine which forward ASVs match to the reverse complement ASVs, and using this info to merge the frequencies together in the feature table. But it's complicated by the fact that I only have 300 bp of the ~312 base target amplicon region (due to truncating all sequences to a uniform length for Deblur'ing), so I would have to match the ends of the ASVs together. That kind of matching would be a bit fuzzy since even if there was a perfect match, it's not guaranteed that would be the same ASV because I wouldn't be accounting for the first and last 12bp where there is no overlap. Furthermore, I'm assuming that having reads in both directions adversely affects the de-noising process in Deblur.
Although I really wanted to use Deblur for the analysis, I've decided that maybe I can go with OTU clustering using q2-vsearch to solve my problem. As I'm looking through the documentation for vsearch cluster-features-open-reference, I see that there is an option, --p-strand, where I can specify to search both forward and reverse complement in the clustering, which should solve the issue I'm having with the non-oriented libraries. However, I don't see this option available in vsearch cluster-features-de-novo, which is the method I would like to use. After looking at the vsearch documentation, I can see that the default for this option in vsearch is to only search the forward strand in the clustering.
I looked through the codebase for q2-vsearch on Github and in the cluster-features-de-novo() method, no argument for the "--strand" option is passed along to the system call to run vsearch, so vsearch is only searching the forward stand, the default behavior. As such, it looks like I'm out of luck with using cluster-features-de-novo, since it won't work with the sequences that are in the reverse complement direction.
While perusing through the code, I also noticed something I though was odd in the cluster-features-open-reference() method. This method has the "strand" argument for the "--strand" option in vsearch, which the user can set to "both" in order to handle input sequences that are in the "wrong" direction, so to speak. This option is used in the function call to cluster_features_closed_reference() (line 315), however, this option is not passed along to the cluster_features_de_novo() function call (line 336) that is used to cluster the remaining sequences that failed to cluster along with / match the reference database. My concern with this is that if a user wants both strands (forward and reverse complement) to be searched in the open-reference clustering, why wouldn't that same functionality/setting be used in both the closed-reference and the de-novo step of the clustering? It seems like the way the code operates now, it satisfies the use case for those who have an oriented library where all sequences are in the same direction, but that direction is the opposite of the direction of sequences in the reference. However, it doesn't work or provide a solution for those with non-oriented libraries, because the de-novo part of the clustering would still result in reverse complement duplicates of OTU sequences.
I feel like I have exhausted all possible options within QIIME2 that I could use to remedy my issue, but I wanted to gain some insight from the devs and the community to see if my thought process on all of this is correct. I very well may be off in my analysis. Worst case, I could do the OTU clustering outside of QIIME2, convert formatting as necessary and import the table and rep seqs back into QIIME2. I'm also thinking of simply altering the code in q2 vsearch cluster_features_de_novo in my installed version of the software. I think just adding in the option for "--strand" to the function would provide the solution I need, but I wanted to consult with y'all first. I can also fork the Github repo, make the change and submit a PR if this solution is valid.
I thank the QIIME 2 developers and the community for your replies and your help!
All the best,
Taylor