Importing files fastq.bz2 format

Danilo_Reis · February 17, 2021, 3:01pm

Hi everyone,
I had some contact with Qiime2 before but only now that I got my NGS data I am trying to go more deep into it. I received demultiplexed paired-end sequences.

I have a basic question regarding importing data. The sequences are in the format "fastq.bz2". Each sample (In total I have 200) has one folder with some files including these three: R1.fastq.bz2, R2.fastq.bz2 and joined-SR.fastq.bz2.

Reading the tutorials, I understood that I will have to create a manifest file to import the sequences due to its format. Is that right?
If so, do I need to have all the three files (mentioned above) corresponding to each sample located in a single folder? Or is is possible to import all the folders from each sample at once?

Thank you so much for your help!

llenzi · February 17, 2021, 5:37pm

Hi @Danilo_Reis,
Welcome in the forum!

To answer straight to your question, if you prepare a manifest file for the importing step, you don't need to move all files in the same folder, as long as you specify the path for each file correctly.

However, there are few things to note. I don't think qiime2 is able to work with 'bz2' files, hence you may have to decompress (then recompress with gzip if you want to save space, you can import as 'fastq.gz' files).

On the files you need, you probably want to import R1.fastq.gz and R2.fastq.gz, as paired files, as described in the Importing data — QIIME 2 2020.11.1 documentation (Casava 1.8 paired-end demultiplexed fastq ). What you need to understand is if these sequences have been preprocessed somehow or are raw (eg containing sequencing adapters and low quality tails), to see from which point of the analysis process you need to start.
Hope it helps
Cheers

Danilo_Reis · February 18, 2021, 9:50am

Hi @llenzi ,
Than you so much for your reply. I am a bit confused from where to start analysing the data since the sequences have been preprocessed (See below the information I got from the company).

What I suppose is that I should start by importanting the "joined sequences" of each sample in bz2 format by decompressing and recompressing again, like you said.

This is the info I received from the company:

Delivery contents:
• ’RAW’: raw sequencing data after basecalling and demultiplexing in compressed
FASTQ format
• ’AdapterClipped’: compressed FASTQ files containing sequencing adapter clipped
reads
• ’PrimerClipped’: compressed FASTQ files containing primer sorted reads
• ’Combined’: compressed FASTQ files containing consensus sequences after over-
lap combination of forward and reverse reads

Data analysis:

Demultiplexing of all libraries for each sequencing lane using the Illumina bcl2fastq***
v2.20 software [3] (folder ’RAW’):
• 1 or 2 mismatches or Ns were allowed in the barcode read when the barcode distances between all libraries on the lane allowed for it

Sorting of reads by amplicon inline barcodes (folder RAW):
• 1 mismatch was allowed per barcode
• the barcode sequence was clipped from the sequence after sorting
• reads with missing barcodes, one-sided barcodes or conflicting barcode pairs were discarded

Clipping of sequencing adapter remnants from all reads (folder ’AdapterClipped’):
• reads with final length < 100 bases were discarded***

Primer detection and clipping (folder ’PrimerClipped’):
• 3 mismatches were allowed per primer
• pairs of primers (Fw-Rev or Rev-Fw) had to be present in the sequence fragments
• if primer-dimers were detected, the outer primer copies were clipped from the sequence
• the sequence fragments were turned into forward-reverse primer orientation after removing the primer sequences

Combination of forward and reverse reads using BBMerge v34.48 [2] (folder ’Combined’):
• the consensus sequence of combinable fragments are named “joined-SR”, uncombinable read pairs sequences end up in the “R1” and “R2” files
Creation of FastQC reports for all FASTQ files

cheers,
Danilo

llenzi · February 18, 2021, 10:37am

Hi @Danilo_Reis,

good, you have lots of possibilities here. On what marker gene are you working? What length are the sequences? Did you performed the initial PCR or your provider did that (aka do you have the primer sequence used for the initial PCR)? These answer may change a bit the approach.

In general I like to start with sequences as raw as possible, to have more control on the process, if this fails for some reason I may work with more processed sequences, trimmed sequences, followed by joined sequences as last resort (but again this is just my personal choice really).
Using reads at different stages, may have implication on which tools you could use for the denoising step.
Given you have already demultiplexed sequences, the pipeline you could use are roughly the following:

Starting from RAW seqs:
Importing demultiplexed seqs -> Adapter/PCR primer trimming -> denoising -> diversity and taxonomy classification

In this way you have to repeat all the steps, but you have control on how many reads you loose at the trimming step. You can use both dada2 and deblur for denoising. By what they write, it seems that the raw dataset contains sequences in mixed orientation and they are turned to forward-reverse after removing primers. To cope with this, you may have to preprocess the sequences to re-orient them by yourself (the Rescript plug in in qiime2 could help you with this).

Starting from Primer Clipped seqs:
denoising -> diversity and taxonomy classification

You import the trimmed sequences, you can denoise with dada2 or deblur. This is probably the easier pipeline for you to start with, however, you may have less choice on the denoising setting to use (especially dada2) because the sequences are preprocessed. The length of the sequences is important to give you room for the merging step, so if they trim the low quality tails you may have less flexibility for this, if they not ... well that good for you!

Starting form the joined seqs:
denoising -> diversity and taxonomy classification

It seems as the previous one, but given you are using already merged sequences you have no choice and have to use deblur for the denoising step. This is probably the quickest pipeline, but it has not my preference because you have no control on the merging step, which I would try to perform within the dada2 denoising step or with vsearch if you prefer to denoise with deblur (all may be done within the qiime2 environment), but again this would be my personal preference, nothing wrong to use BBMerge as they did really.

As general point, qiime2 has a nice way for embedding in each object its provenance and what happened to get it, so starting to use qiime2 as early as possible it may result very handy in long term too for tracking purposes.

I hope this clarify the range of possibilities you have.
Cheers
Luca

Danilo_Reis · February 18, 2021, 11:00am

Hi @llenzi ,

Things are definitely more clear now. Thanks a lot.
I am working with ITS region for fungal diversity. The sequences should have around 300 bp.
I performed two PCRs by myself before sending the product to the company. The first to amplify my target region and the 2nd (short one) to ligate the adapters.
These are the primers:
ITS1FKyo2:TAGAGGAAGTAAAAGTCGTAA
ITS86R: TTCAAAGATTCGATGATTCA

Cheers,
Danilo

llenzi · February 18, 2021, 11:40am

Hi @Danilo_Reis,

I see, I am not an ITS expert, but you may look at:

Which it is a bit old but it may be still relevant!
To my you are good to go with starting from the raw reads,
with the possible caveat of re-orient them. For the database, I suggest to look at:

Cheers
Luca

system · March 21, 2021, 5:41pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.