Paired-end or single end sequences fastaq?

teleos · February 5, 2018, 3:37pm

Hello,
I am looking into the data of the Earth microbiome release 1, and I got the Fastaq for the study using this code:emp/code/download-sequences/download_ebi_fastq.sh at master · biocore/emp · GitHub in the ENA database. I want to demultiplex, denoise with dada2 and then merge my sequences. However I am having some difficulty figuring out if the reads that I got are single reads or pair end reads.

This is how one sample sequence looks like:
@1799.L9.L9_0 orig_bc=ATCATCTGGGTT new_bc=ATCATCTGGGTT bc_diffs=0
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTTGGTCAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGATACTGCCAGGCTAGAGTATGTTAGAGGAATGCGGAATTCCGGGT
+
CCCBCBBDBCCCGGGGGEGGGGHGGGGGHGHHHHGHGGGGHGHGGGGGGGGDHFGGGGGGHHHHHHHHHHHHHHHHHHHHGGGGDCHHHHHHHHGHHGHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHGHBGHHHHGGFCGHHHHGG:
@1799.L9.L9_1 orig_bc=ATCATCTGGGTT new_bc=ATCATCTGGGTT bc_diffs=0
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCTGATTAGTCAGTCAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTGATACTGCTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCAAGT
+
BBBABBADBBBBEGGGGGGEFGHGGGGGHGHHGHHHGGGGHGHEFGGGHGGEGFHHFHHHHHHHHGHHFGCFGHHGHHGHHHGGGFGGHHHGHGHGHHHGHCGHHHHGHFHHEHHHHHHHHFHHHGFGGFGEC/GHGHHHGGHHHHFHHG1
@1799.L9.L9_2 orig_bc=ATCATCTGGGTT new_bc=ATCATCTGGGTT bc_diffs=0
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCTGATTAGTCAGTCAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTGATACTGCTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCAAGT

Does anyone know Is there a way to generate R1 and R2 from these sequeces? Any help will be appreciated!

Lu_Yang · February 5, 2018, 9:38pm

Hi, @teleos,

May I ask you a question? how did you download the file from ENA database?
Thanks

jairideout · February 5, 2018, 10:04pm

Hi @teleos! I've contacted one of the EMP maintainers and will follow up when I hear back. Thanks!

jairideout · February 6, 2018, 4:22pm

Hi @Lu_Yang! See @teleos original post for the script used to download the data from ENA.

jairideout · February 6, 2018, 4:40pm

Hi @teleos! I heard back from @Luke_Thompson about the EMP data set. To answer your original question, the reads you downloaded from ENA are forward reads only (i.e. single-end), so qiime dada2 denoise-single would be the appropriate denoising method.

However, before you start denoising these sequences with DADA2, there are some considerations:

DADA2 works best when applied to raw sequence data that has all sequencing artifacts removed (e.g. primers, barcodes, adapters). The data you obtained from ENA has had some sort of quality-filtering performed on it already, as the sequences appear to be trimmed to various lengths (i.e. the sequences are not all the same length, as we'd expect to see from an Illumina sequencer). My hunch is that QIIME 1's split_libraries_fastq.py script was used to demultiplex the data and perform some quality-filtering and trimming prior to ENA submission. @Luke_Thompson is digging into this to find out exactly what the preprocessing steps were, so that we can give you advice on how to denoise the sequences with DADA2. It's possible we may need to back up a step and obtain the raw sequences from somewhere else. See this forum topic for some discussion about quality-filtering prior to DADA2 denoising.
DADA2 operates best when applied to a single Illumina sequencing run at a time. You'll need to figure out which FASTQ files belong to the same sequencing run, and denoise the FASTQ files on a per-run basis. This will result in a feature table for each run, which can be merged with qiime feature-table merge for downstream analyses.

To figure out which FASTQ files belong to the same sequencing run, here's @Luke_Thompson's suggestion:

"The master EMP mapping file (emp_qiime_mapping_release1.tsv, available from EMP GitHub and FTP site) has run_center and run_date listed. This should give a pretty good approximation of which run is which. But let me see if I can find an actual run ID number that would be more conclusive."

@Luke_Thompson offered to follow up here when he has more details. In the meantime, you could try out Deblur (qiime deblur) to denoise these data. The EMP release 1 used Deblur to denoise the sequences, so if you're interested in trying that out, perhaps @Luke_Thompson or @wasade could help guide you with those analyses.

Thanks!

Lu_Yang · February 6, 2018, 5:58pm

Hi, @jairideout,

I have achieved. Thanks so much.

Best.

wasade · February 7, 2018, 6:31pm

Optionally, it's also possible to just obtain the already Deblur'd data from the FTP (ftp://ftp.microbio.me/emp/release1/otu_tables/deblur/) if you want to avoid the compute expense.

Best,
Daniel

jairideout · February 7, 2018, 10:48pm

Thanks @wasade! I'm not able to navigate to the FTP link you posted -- could you double-check that please?

Luke_Thompson · February 9, 2018, 12:15am

Hi @jairideout, I checked that FTP link and it works for me.

Luke_Thompson · February 9, 2018, 12:15am

Hi @teleos and @jairideout,

Details about the sequence processing are on GitHub at emp/methods at master · biocore/emp · GitHub. In short, we ran the QIIME 1 command split_libraries_fastq.py with Phred quality threshold of 3 and default parameters. Then we ran the notebook emp/code/02-sequence-processing/adaptor_cleanup.ipynb at master · biocore/emp · GitHub.

I hope this helps!
Luke

jairideout · February 9, 2018, 6:17pm

Thanks for the details @Luke_Thompson!

@teleos, you may be better off analyzing the ENA/EBI data using q2-deblur, or the precomputed Deblur results @wasade linked to. The QIIME 1 split_libraries_fastq.py step performs quality-score based filtering and truncates the reads based on Phred score threshold, among other variables. This quality filtering step is recommended prior to using Deblur, and is analogous to QIIME 2's qiime quality-filter q-score method.

My understanding is that DADA2 works best with raw sequence data that hasn't been previously quality filtered or trimmed, since DADA2 uses the raw quality scores in its error models and performs its own trimming steps. DADA2 expects sequences that have all sequencing artifacts removed (e.g. primers, barcodes, and adapters). It sounds like the ENA/EBI data already have those sequencing artifacts removed, but I'm unsure whether the split_libraries_fastq.py filtering/trimming steps are acceptable prior to denoising with DADA2.

@benjjneb, do you have any advice on whether these preprocessed sequences are okay to use with DADA2? Thanks!

benjjneb · February 9, 2018, 7:12pm

It's OK to filter or trim before the q2-dada2 plugin. It's just usually redundant as the plugin workflows have a built-in filter and trim step.

For single-end data its totally fine.

For paired-end data, problems arise for technical reasons. QIIME1 filters the forward and reverse reads independently, so the reads in the filtered files are no longer in matching order as required by the DADA2 plugin, which will probably break the workflow. (the R package has a flag to fix such F/R mismatching, but its not available to the plugin).