Truncate reads before VSEARCH with pyrosequecing/ion torrent data

ja.morillo · November 21, 2018, 8:26pm

First, thanks a lot for all the development in qiime2 and the help provided here, you are amazing people!

I need to analyze several 454 and ion-torrent datasets. I would like to use clustering-vsearch-otus first, and it would be great to do the whole analysis in qiime2. I know that there is a straightforward tutorial for this already in qiime2: Clustering sequences into OTUs using q2-vsearch.

BUT the tutorial starts with already quality-control data.

Short question: ¿how can I quality-control my 454 data before the clustering, in qiime2, including trimming all reads to a fix lenght? (I would like to avoid shifting qiime1-qiime2).

Same question expanded:quality-demux-filter-stats.qzv (1.2 MB)

For the clustering (regarding qiime2) I need that the data follow this requirements:

non-biological sequences are removed
reads are all trimmed to the same length
low-quality reads are discarded

So far I was able to do all these steps in qiime2 with the exception of TRIMMING to the same length (I want 300 bp after checking a raw data fastqc analysis).

I am executing the following commands (following the flowgram here for single-end baroded data Overview of QIIME 2 Plugin Workflows — QIIME 2 2018.11.0 documentation):

demultiplex (it works)

qiime cutadapt demux-single --i-seqs raw_data.qza --o-per-sample-sequences demux.qza --o-untrimmed-sequences untrimmed.qza --m-barcodes-file metadata.tsv --m-barcodes-column BarcodeSequence

trim the primers (it works)

qiime cutadapt trim-single --i-demultiplexed-sequences demux.qza --p-front AGAGTTTGATCMTGGCTCAG --p-adapter GCTGCCTCCCGTAGGAGT --o-trimmed-sequences demux-trimmed.qza

low quality reads (it works)

qiime quality-filter q-score --i-demux demux.qza --o-filtered-sequences quality-demux-filtered.qza --o-filter-stats qiality-demux-filter-stats.qza --p-min-quality 25

¿How can I trim now all reads to 300 bp? I don’t understand the utility here of the option “–p-min-length-fraction” as apparently it is not possible to fix a read length (say 300 bp).

Thanks in advance for your help!

NOTE: I guess the plugin “denoise-piro” (denoise-pyro: Denoise and dereplicate single-end pyrosequences — QIIME 2 2018.11.0 documentation) is not what I have to use for this (?), because it applies the DADA2 error correction algorithm, and the output is an ASV table, what for me does not make sense as an input for vsearch-clustering. Please, correct me in case I am wrong.

Nicholas_Bokulich · November 22, 2018, 2:03pm

If you use dada2, you can use the --p-trunc-len parameter to trim to the same length.

I would recommend using that — it will apply the length trimming you require, as well as denoising the sequences. The output can still be used as input for OTU clustering. Think of it this way: before OTU clustering, we dereplicate the sequences to generate a list of unique sequence variants. With denoising methods, we produce the same output — unique sequence variants — except that they have been denoised to remove/correct erroneous sequences.

So use dada2 denoise-pyro instead of q-score, then input to q2-vsearch for OTU clustering.

Good luck!

ja.morillo · December 10, 2018, 3:30pm

Dear Nicholas:

Thanks a lot for your useful response. Our data set (soil samples) was generaed by a 454-pyrosequencing machine. But because this discussion is probably interesting also for people working with Ion Torrent data -both technologies share the same issue with the nucleotide homopolymers-, we would like to clarify this topic a bit more:

I have followed two different approaches using qiime2 to analyze our 454-data (in this case 16S rRNA amplicons):

denoise-pyro -> “ASV table” -> diversity metrics
denoise-pyro -> VSEARCH (97, 98, 99% thresholds) -> “OTU table” -> diversity metrics

Which way would be in principle the most appropriate? In terms of betadiversity both options produce similar results. However, as expected when we look in detail for example differential abundance statistics, things are very different. The resolution should be better with the first option -that is what we aim using qiime2-, but I don´t know if there is some noise “left” in the DADA2 ASVs, and then clustering the ASVs in OTUs is still recommended.

In other words: do the “features” produced by DADA2-qiime2 using Ion Torrent or 454 data as input –instead of Illumina- have a direct ecological meaning, or is “denoise-pyro“ a strict denosing procedure that need to be followed by clustering in OTUs?

DADA2 has their own 454/ion torrent protocol. Are the same settings incorporated by qiime2 in this script? any bechmarking available?

I know that there are not “magic” answers, but a general solid decision would be of big help because lots of data need to be analyzed yet by both technologies and qiime2 is a great tool!

Thanks again!!

Nicholas_Bokulich · December 11, 2018, 1:11pm

dada2 alone is more appropriate, but there is nothing technically wrong with subsequent clustering.

Yes, those same settings are used in the denoise-pyro method. This was written by the dada2 developers themselves — you should check their website for any benchmarking (I have not tested this method personally since I do not use pyro data).

In theory dada2 should provide the same type of output for pyro data, and subsequent clustering should be unnecessary.

I hope that helps!

ja.morillo · December 11, 2018, 1:51pm

Yes, it helps. Thanks!!

system · January 11, 2019, 7:51pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.