Low Resolution Vsearch Results

Hi,

Could I please have some advice to improve these reusults? This is an example using a subset of my samples.

I am working on 150bp ZBJ COI paired sequences. My pipeline is as follows

qiime cutadapt trim-paired
--p-cores 5
--i-demultiplexed-sequences ZBJ_dataimported.qza
--p-front-f CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
--p-front-r CTGTCTCTTATACACATCTGACGCTGCCGACGA
--p-no-discard-untrimmed
--o-trimmed-sequences cutadapters.qza

qiime cutadapt trim-paired
--p-cores 10
--i-demultiplexed-sequences cutadapters.qza
--p-front-f AGATATTGGAACWTTATATTTTATTTTTGG
--p-front-r WACTAATCAATTWCCAAATCCTCC
--p-match-read-wildcards
--p-match-adapter-wildcards
--p-discard-untrimmed
--o-trimmed-sequences miseq.cut.noadapters.qza

cut results miseq.cut.noadapters.qzv (323.4 KB)

qiime dada2 denoise-paired
--i-demultiplexed-seqs miseq.cut.noadapters.qza
--p-trunc-len-f 105
--p-trunc-len-r 125
--p-n-threads 20
--o-table ZBJ_featuretable-miseq.noadapters.qza
--o-representative-sequences ZBJ_rep_seqs-miseq.noadapters.qza
--o-denoising-stats ZBJ_stats-miseq.qza

ZBJ_stats-miseq.qzv (1.2 MB)

qiime feature-classifier classify-consensus-vsearch
--i-query ZBJ_rep_seqs-miseq.noadapters.qza
--i-reference-reads arthropoda-ref-seqs-derep.qza
--i-reference-taxonomy arthropoda-ref-tax-derep.qza
--p-perc-identity 0.97
--p-threads 5
--o-classification taxonomy.qza
--o-search-results results.qza

My vsearch results only moslty only goes to genus. The reference database was downloaded from bold and dereplicated.

I read on a forum that the mimimum overlap needs to be aleast 20bp although the default setting is 12bp. The overlap in my sequences is 105+125 = 225 - 211 (ZBJ amplicon length) = 14. Therefore I have also tried Deblur as an alternative approach.

qiime vsearch merge-pairs
--i-demultiplexed-seqs miseq.cut.qza
--p-threads 7
--o-merged-sequences miseq.cut.joined.qza

miseq.cut.joined.qzv (301.8 KB)

qiime deblur denoise-other
--i-demultiplexed-seqs miseq.cut.joined.qza
--i-reference-seqs arthropoda-ref-seqs-derep.qza
--p-trim-length 156
--p-sample-stats
--p-jobs-to-start 4
--o-representative-sequences miseq.joined.rep-seqs.qza
--o-table deblur.joined.table.qza
--o-stats deblur-joined.stats.qza

deblur.stats.qzv (229.7 KB)

I carried out vsearch again with the same parameters and recieved the same results.
taxonomy.dada2.tsv (22.4 KB)

Any advice would be greatly appreciated, thank you.

Hello Rach3l,

I'm not super familiar with the COI region and I have never heard of ZBJ.
Would you be willing to share a citation so I can learn more?

While it's up to you to choose methods and settings, I hope I can help fill in some gaps.

Primer removal:
First, I think running cutadapt trim-paired twice is very clever. Anything left should be your real amplicon. And the quality afterwards looks good!

Read joining:
Based on the quality score plots, I would try trimming at 125 in both directions and see if all those extra bases help or hurt read merging. (It may or may not!) 70% joined in most samples is already good!

Taxonomy:

I can see how this is frustrating, though I'm not surprised. With <200 bp in length, there's only so much information each ASV can contain. Genus level is pretty good!

If absolutely necessary, the next step is a custom database. This process is intensive, though RESCRIPt can make it much easier for support sources.

3 Likes

Hi Colin,

Thank you for your detailed response.

Diet DNA has been amplified by primers targeting insect DNA in the COI gene created by Zeale et al., 2011.

It's great that I'm getting back genus information but my instituation has a bioinformatics pipeline that gives results to the species level. The pipeline performs OTU clustering and taxonomic assignment by blast. I'd like to try and improve the Qiime2 results to be similar to my insititutions pipeline as I think Qiime2 is more intuitive.

Could the issue be that the sequenes are too short?

It could be, especially because short sequences have less information inside of them.

It would be very helpful to compare these two pipelines, especially how they assign taxonomy to feature sequences.

I don't want to be critical before I know more, but it's pretty easy to make a pipeline that will confidently (and inaccuracy) classify anything to the species level. Overclassification has been a large problem, so I like to have a benchmark that shows you when a program is TOO confident and produces false positives.

Is their pipeline open-source?

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.