Dealing with non-oriented amplicon libraries

toconnell · October 26, 2018, 11:25pm

Hi QIIME2 community,

I'm banging my head against the wall trying to find a way to deal with my non-oriented amplicon sequencing data. The way the libraries were dual indexed, i have reads oriented in both directions (forward and reverse complement) and after running the data through Deblur, I'm left with reverse complement duplicates of ASVs which is inflating the number of ASVs by ~2x.

I've scoured the forums and found two threads dealing with the same issue (DADA2 with non-oriented library, and Any other option for doing the feature table without trimming from PGM data?), however, solutions proposed there just don't work for my case. The data I have is merged paired-end reads (merged previously by colleagues using PEAR) and the primer sequences have been trimmed off already. I don't have access to the original raw fastq, so re-orienting the reads based off of the primers isn't an option.

I have thought about doing an all-by-all comparison of the ASVs output by Deblur in order to determine which forward ASVs match to the reverse complement ASVs, and using this info to merge the frequencies together in the feature table. But it's complicated by the fact that I only have 300 bp of the ~312 base target amplicon region (due to truncating all sequences to a uniform length for Deblur'ing), so I would have to match the ends of the ASVs together. That kind of matching would be a bit fuzzy since even if there was a perfect match, it's not guaranteed that would be the same ASV because I wouldn't be accounting for the first and last 12bp where there is no overlap. Furthermore, I'm assuming that having reads in both directions adversely affects the de-noising process in Deblur.

Although I really wanted to use Deblur for the analysis, I've decided that maybe I can go with OTU clustering using q2-vsearch to solve my problem. As I'm looking through the documentation for vsearch cluster-features-open-reference, I see that there is an option, --p-strand, where I can specify to search both forward and reverse complement in the clustering, which should solve the issue I'm having with the non-oriented libraries. However, I don't see this option available in vsearch cluster-features-de-novo, which is the method I would like to use. After looking at the vsearch documentation, I can see that the default for this option in vsearch is to only search the forward strand in the clustering.

I looked through the codebase for q2-vsearch on Github and in the cluster-features-de-novo() method, no argument for the "--strand" option is passed along to the system call to run vsearch, so vsearch is only searching the forward stand, the default behavior. As such, it looks like I'm out of luck with using cluster-features-de-novo, since it won't work with the sequences that are in the reverse complement direction.

While perusing through the code, I also noticed something I though was odd in the cluster-features-open-reference() method. This method has the "strand" argument for the "--strand" option in vsearch, which the user can set to "both" in order to handle input sequences that are in the "wrong" direction, so to speak. This option is used in the function call to cluster_features_closed_reference() (line 315), however, this option is not passed along to the cluster_features_de_novo() function call (line 336) that is used to cluster the remaining sequences that failed to cluster along with / match the reference database. My concern with this is that if a user wants both strands (forward and reverse complement) to be searched in the open-reference clustering, why wouldn't that same functionality/setting be used in both the closed-reference and the de-novo step of the clustering? It seems like the way the code operates now, it satisfies the use case for those who have an oriented library where all sequences are in the same direction, but that direction is the opposite of the direction of sequences in the reference. However, it doesn't work or provide a solution for those with non-oriented libraries, because the de-novo part of the clustering would still result in reverse complement duplicates of OTU sequences.

I feel like I have exhausted all possible options within QIIME2 that I could use to remedy my issue, but I wanted to gain some insight from the devs and the community to see if my thought process on all of this is correct. I very well may be off in my analysis. Worst case, I could do the OTU clustering outside of QIIME2, convert formatting as necessary and import the table and rep seqs back into QIIME2. I'm also thinking of simply altering the code in q2 vsearch cluster_features_de_novo in my installed version of the software. I think just adding in the option for "--strand" to the function would provide the solution I need, but I wanted to consult with y'all first. I can also fork the Github repo, make the change and submit a PR if this solution is valid.

I thank the QIIME 2 developers and the community for your replies and your help!

All the best,
Taylor

colinbrislawn · October 27, 2018, 6:59pm

Hello Taylor,

Thank you for your detailed post.

I think you are right on the money: Qiime 2 does not have a perfect way to deal with non-oriented amplicon reads. There are options... but the easy ones are imperfect.

The easiest option is to do some sort of referenced based analysis, like qiime vsearch cluster-features-closed-reference, and use the --p-strand both option to flip around reads that were facing the wrong way. This method will remove any reads that didn't closely match the database.

Like you discuss, open-ref and denovo clustering would solve this problem, but the --strand both option is mysteriously missing! Where has it gone?

Great question: What's wrong with --strand both de novo clustering?

While this method would correctly cluster similar reads in both orientations, without a reference showing which strand is plus, your mixed-orientation reads get clustered into mixed-orientation OTUs. Are you prepared to do taxonomy assignment and treebuilding on mixed-orientation OTUs? I'm not!

I guess I just wanted to let you know that this problem really is hard, and you are not the only one struggling with it.

My suggestion? Try qiime vsearch cluster-features-closed-reference --strand both and see how the results look. If this method captures most of your reads, this might be an OK solution to this tough, tough problem.

Colin

toconnell · October 29, 2018, 8:17pm

Hey Colin,

Thank you for the quick response! I really appreciate your help with this!

I'm going to go ahead and take your advice and do closed-reference OTU picking to see what I get. I'm a bit worried about missing amplicon sequences that don't hit the reference, especially since the data I'm dealing with is going to have a lot of stuff that is not well represented in the database. However, it sounds like this is the best option I have.

Also, thanks for clearing up the issue with using --strand both in de novo clustering! I had not though about how that would affect downstream steps of the analysis. I agree that having mixed-orientation OTUs would be problematic for treebuilding and taxonomy assignment. But, I'm thinking I could handle that issue by aligning the mixed-orientation OTU rep-seqs against a reference database, examining the direction of the alignment, and then using that info to flip around the sequences in the "wrong" direction. That way I'm not losing the sequences that are too divergent to hit the reference. Does that sound like a reasonable option?

Best,
Taylor

colinbrislawn · October 29, 2018, 9:48pm

Hello Taylor,

Yes, that sounds like a good option.

Awhile ago, I wrote a Snakemake script that does this direction correction! The key lines are here. I have not tested this on your data set so use at your own risk!


rule map_to_ref:
    input:
        query="../tip_Step1/all-sequences.fasta", # From hundo
        ref="../ref.fna"
    output:
        fasta= "vsearch-map/matched.both.fasta",
        blast6= "vsearch-map/matched.both.blast6",
        uc= "vsearch-map/matched.both.uc"
        #,sam= "vsearch-map/matched.both.sam"
    shell:
        "vsearch --usearch_global {input.query} -db {input.ref} "
        "--id .7 --strand both --threads {THREADS} --dbmask none --qmask none "
        "--output_no_hits --sizein --sizeout "
        "--matched {output.fasta} --blast6out {output.blast6} --uc {output.uc}"

rule remap_reverse_comp:
    input:
        "vsearch-map/matched.both.fasta"
    output:
        "vsearch-map/matched.both.rc.fasta"
    shell:
        "vsearch --fastx_revcomp {input} --fastaout {output} --sizein --sizeout"


rule remap_forward_reads:
    input:
        matches= rules.map_to_ref.output.fasta,
        ref="../ref.fna"
    output:
        "vsearch-map/matched.F.fasta"
    shell:
        "vsearch --usearch_global {input.matches} -db {input.ref} "
        "--id .7 --strand plus --threads {THREADS} --dbmask none --qmask none "
        "--sizein --sizeout --matched {output}"

rule remap_reverse_reads:
    input:
        matches= rules.remap_reverse_comp.output,
        ref= "../ref.fna"
    output:
        "vsearch-map/matched.R.fasta"
    shell:
        "vsearch --usearch_global {input.matches} -db {input.ref} "
        "--id .7 --strand plus --threads {THREADS} --dbmask none --qmask none "
        "--sizein --sizeout --matched {output}"


rule remap_combine_forward_and_reverse:
    input:
        forward= rules.remap_forward_reads.output,
        reverse= rules.remap_reverse_reads.output
    output:
        "vsearch-map/matched.corrected.fasta"
    shell:
        "cat {input.forward} {input.reverse} > {output}"

Colin

toconnell · November 1, 2018, 7:02pm

Hey Colin,

This is great! I'm sure glad that I can do the de-novo clustering and then use mapping to the reference to re-orient the resulting OTU rep seqs. This will keep me from missing out on any cool new stuff present in the samples that isn't yet well-represented in the database. Thanks so much for the snakemake script! I will use caution in applying it to my data.

Best,
Taylor

system · December 3, 2018, 1:02am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.