Analyzing paired end reads in QIIME 2

gregcaporaso · November 30, 2017, 2:56pm

This community tutorial has been migrated to our official documentation. Please refer to that tutorial instead.

Click to see original community tutorial

This tutorial covers paired end read analysis in QIIME 2. It was developed and tested with the QIIME 2 2017.11 release. QIIME 2 releases prior to 2017.11 are not compatible with this tutorial.

Note: This tutorial does not cover read-joining and denoising with DADA2. Instead, this tutorial focuses on alternative approaches to analyzing paired end reads in QIIME 2. If you are interested in joining and denoising reads with DADA2, the Atacama soil microbiome tutorial illustrates how to use qiime dada2 denoise-paired for this purpose. If you plan to use DADA2 to join and denoising your paired end data, do not join your reads prior to denoising with DADA2; DADA2 expects reads that have not yet been joined, and will join the reads for you during the denoising process.

In QIIME 2, we use the term single end reads to refer to forward or reverse reads in isolation; we use the term paired end reads to refer to forward and reverse reads that have not yet been joined; and we use the term joined reads to refer to forward and reverse reads that have already been joined (or merged). It is important to understand which of these terms apply to your data, as this will determine what steps are necessary to analyze your paired end data.

It is currently possible to join paired end reads in QIIME 2 using the q2-vsearch plugin, or to import reads that have been joined outside of QIIME 2 (for example, with fastq-join). This tutorial will cover both of these processes.

Joining reads and analyzing joined reads

First, we'll join paired end reads with QIIME 2. Begin by downloading the following SampleData[PairedEndSequencesWithQuality] artifact, which contains the demultiplexed reads from the Atacama soil microbiome tutorial.

demux.qza

Joining reads

Next, use the join-pairs method in the q2-vsearch plugin to join the reads:

qiime vsearch join-pairs \
  --i-demultiplexed-seqs demux.qza \
  --o-joined-sequences demux-joined.qza

Viewing summary of joined data with read quality

You can next generate a summary of the demux-joined.qza artifact.

qiime demux summarize \
  --i-data demux-joined.qza \
  --o-visualization demux-joined.qzv

This summary is particularly useful for determining approximately how long your joined reads are (we'll come back to this when we denoise with Deblur). When looking at the quality plots in this visualization, if you hover over a specific position you'll see how many reads are at least that long (of the reads that were sampled for computing sequence quality). Make note of the highest sequence position where most (say, > 99%) of your reads are at least that long.

For example, when hovering over a black box in this visualization (which is generated from a larger data set than the one used in this tutorial), I see that 10000 out of the 40126 sequences were used to estimate the quality score distribution at this position.

When I hover over position 250, which is illustrated with a red box, I see that some sequences are not this long because only 9994 sequences were used for estimating the quality score distribution at this position. The red box and the red text below tell me that some sequences were not at least this long.

When I hover over position 254, which is also illustrated with a red box, I see that many sequences are not this long because only 845 sequences were used for estimating the quality score distribution at this position.

Based on a comparison of these plots, I will note that most of my sequences are at least 250 bases long. We plan to simplify this process in the near future (see add seven-number summary of read length distribution to `summarize` · Issue #71 · qiime2/q2-demux · GitHub).

Sequence quality control

Next we'll apply quality control to our sequences using quality-filter q-score-joined. This method is identical to quality-filter q-score, except that it operated on joined reads. The parameters to this method have not been extensively benchmarked on joined read data, so we recommend experimenting with different parameter settings.

qiime quality-filter q-score-joined \
  --i-demux demux-joined.qza \
  --o-filtered-sequences demux-joined-filtered.qza \
  --o-filter-stats demux-joined-filter-stats.qza

At this stage you can choose to proceed using Deblur for additional quality control, or you can dereplicate sequences and optionally cluster them into OTUs with q2-vsearch. Deblur should give much higher quality results, so we recommend that procedure and will illustrate that approach in the next steps of this tutorial.

If you are instead interested in experimenting with an analysis workflow that is more like QIIME 1 processing (for example, to compare your Deblur or DADA2 result with a QIIME 1-like pipeline), you should next dereplicate and cluster your sequences. If you try this option, we strongly encourage you to call qiime quality-filter q-score-joined with a higher min-quality threshold - possibly --p-min-quality 20 or --p-min-quality 30. You can then follow the steps in the OTU clustering tutorial. After clustering, you will likely want to filter features that are observed in only one sample using qiime feature-table filter-features --p-min-samples.

Deblur

You're now ready to denoise your sequences with Deblur. You should pass the sequence length value you selected from the quality score plots for --p-trim-length. This will trim all sequences to this length, and discard any sequences which are not at least this long.

Note: We use a trim length of 250 based on the quality score plots generated from the tutorial data set. Do not use 250 with your own data set, as this value will depend on your data set's read lengths. Use the quality score plots to choose an appropriate trim length for your data.

qiime deblur denoise-16S \
  --i-demultiplexed-seqs demux-joined-filtered.qza \
  --p-trim-length 250 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --p-sample-stats \
  --o-stats deblur-stats.qza

View summary of Deblur feature table

You can next summarize the feature table resulting from q2-deblur. This table and the corresponding representative sequences are now ready to be analyzed with the same methods and visualizers that would be used on single-end read data.

qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv

Importing pre-joined reads

If you have joined your reads outside of QIIME 2, for example with fastq-join, this section will illustrate how to import those reads. First, download the following demultiplexed and joined read data, which has been joined on a per-sample basis with fastq-join.

fj-joined.zip

Unzip this file as follows:

unzip fj-joined.zip

Import reads

Next, use qiime tools import to import this data. The format that is currently used here is SingleEndFastqManifestPhred33 - this will likely be updated in the future to a format that clearly describes this as joined read data, but in the meantime you should follow the formatting guidelines for the single-end "Fastq Manifest" formats.

qiime tools import \
  --input-path fj-joined/manifest \
  --output-path fj-joined-demux.qza \
  --type SampleData[JoinedSequencesWithQuality] \
  --source-format SingleEndFastqManifestPhred33

Viewing summary of imported data with read quality

You can generate a summary of the resulting artifact as follows:

qiime demux summarize \
  --i-data fj-joined-demux.qza \
  --o-visualization fj-joined-demux.qzv

You can now continue analyses with your joined reads as described above, e.g. quality filtering with q2-quality-filter, denoising with q2-deblur, or dereplicating and picking OTUs with q2-vsearch.

Happy QIIMEing!

Matilda_H-D · December 1, 2017, 1:52am

This is exactly applicable to the project I just started analysing this week! Argh! I should have known better than to start just before a new release.

My reads had been joined with fastq-join (still multiplexed), and then I imported them into Qiime 2 as EMP single-end fastq, then went through demultiplexing, quality filtering and deblur.

Could you please provide some more details about the difference between this approach and the new commands specific to joined reads described in this tutorial? Should I go back and re-import my data as joined reads instead?

Thanks!

gregcaporaso · December 4, 2017, 8:34pm

Hi @Matilda_H-D,
The only difference between these approaches should be in the read joining step. fastq-join and vsearch do this slightly differently so some reads may get joined with one that are not joined with the other. The determination of quality scores for overlapped bases also differs between fastq-join and vsearch, so that might cause some differences in the results of the quality filter step, which uses those quality scores. My guess is that the differences would be small, and in practice not matter a whole lot, but I can't promise that (the only way to know for sure would be to test both pipelines side-by-side).

Hope this helps!

Matilda_H-D · December 4, 2017, 11:43pm

Thanks @gregcaporaso! I might try them both out on the same dataset sometime and see if I find any differences.

gregcaporaso · December 4, 2017, 11:47pm

Sounds good, please let us know what you find - we'll be interested either way.

thermokarst · December 5, 2017, 1:28pm

An off-topic reply has been split into a new topic: Deblur vs DADA2 Questions

Please keep replies on-topic in the future.

Matilda_H-D · December 19, 2017, 1:29am

Hi @gregcaporaso,

Would it make any sense to quality-filter the forward and reverse reads separately before joining them? I've noticed the R2 tends to be generally lower in quality, so was wondering if I could filter such that I only end up with joined reads where the quality stays relatively high across the whole read. I think that would mean losing out R1 reads that are perfectly fine, though, simply because their corresponding R2 was low-quality, filtered out and then the forward and reverse were not able to be joined.

This stage of analysis is pretty new to me -- I'm more used to the downstream community analysis, so I'm not sure if I'm approaching this in the right way. I would appreciate anyone's thoughts!

gregcaporaso · December 19, 2017, 8:58pm

Hi @Matilda_H-D,
That wouldn't be possible to do with QIIME 2, or with any of the other methods that I'm aware of. The issue is that the reads are joined based on their order in the fastq file (so the first read in the forward read fastq file is joined with the first read in the reverse read fastq file, the second read in the forward read fastq file is joined with the second read in the reverse read fastq file, and so on). If you were to perform quality control on the reads independently, the order between the two files wouldn't match up anymore.

In theory, it would be possible to do what you're describing, but it would require some custom coding. I'm not sure that it would buy you a lot though over the pipeline illustrate in this tutorial. Joining with vsearch requires that the bases in the overlap region match (the maximum number that can mismatch can be adjusted with the --p-maxdiffs parameter). This is a pretty stringent quality control measure, because it requires that base calls in the overlap region be made twice independently. There are also some other parameters you can adjust that would achieve similar quality control to what you're describing - see the --p-truncqual and --p-minlen parameters. You could take a look at the vsearch manual for more detail on how all of these work.

Hope this helps!

Setiawan · January 10, 2018, 5:41am

Hi @gregcaporaso,

It is necessary for me to conducting chimera-filtering after deblur? Because in deblur-stats.qza there is unique-reads-chimeric and reads-chimeric, it is mean there is a chimera in my sequences or the following chimera have been removed from my sequences?
And in Identifying and filtering chimeric feature sequences with vsearch tutorial the table.qza and rep-seqs.qza are results from DADA2, so the artifact from Deblur also work too right?

Thank you

jairideout · January 11, 2018, 3:31am

Hi @Setiawan! Both Deblur and DADA2 perform chimera detection and removal, so you don't need to filter the chimeras in a separate step. The vsearch-based chimera filtering Community Tutorial is only necessary if you're not using Deblur or DADA2 (e.g. if you're performing OTU clustering).