How to import paired-end fastq files

nbourre · January 29, 2019, 3:44pm

Hey guys,

I'm totally new to bioinformatics. I have a degree in computer science and very basic knowledge in biology. I have been mandated to start to learn QIIME2 for a project. So I started the "Moving pictures" tutorial, so far so good, but I need to do something specific.

I have two files which I think are PairEnded, because the files ends with xyz_R1.fastq.gz and xyz_R2.fastq.gz. I tried to load the files in examples, but I get a wrong format error.

What can I do with these files? Where do I start?

Thanks for helping me out
The lost guy

thermokarst · January 29, 2019, 4:41pm

Welcome @nbourre!

You are off to a good start! Have you had a chance to review any of the other tutorials on the User Docs site? In particular, have you had a chance to look at these two:

If not, then I suggest you make those your next stop on the QIIME 2 express!

system · March 1, 2019, 10:41pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

nbourre · March 26, 2019, 4:10pm

Hey guys,

I'm a software developer who has been put on a project to learn about bioinfomatics. I've read the tutorial, but it's like chinese for me.

So I have two files with these names

MI.M03992_0107.001.BioOHT_1rc.SCI017689_PCR2-1-XXXXX_R1.fastq.gz
MI.M03992_0107.001.BioOHT_1rc.SCI017689_PCR2-1-XXXXX_R2.fastq.gz

So I need to create a table with the species found in these files with a percentage of presence in each sequencing.

What is the pipeline I need to follow?

Thanks

thermokarst · March 26, 2019, 6:00pm

Hey there @nbourre!

You can import these files and demultiplex using the cutadapt protocol.

This is one of the highlighted outputs in the Moving Pictures tutorial.

nbourre · March 26, 2019, 8:38pm

Hi @thermokarst,

Thanks for giving a hand. I already tried the Moving Pictures Tutorial. But I get stuck when it comes to provide a ".tsv" file. I don't have this kind of file.

Eg. Here and here.

Is there a way to workaround this problem?

thermokarst · March 26, 2019, 8:45pm

Yep! The "problem" can be rectified by checking out this tutorial:

https://docs.qiime2.org/2019.1/tutorials/metadata/

nbourre · April 2, 2019, 1:20pm

The content of the forward sequence (fastq) is in this form.

@M03992:107:000000000-AWNDM:1:2106:11512:1300 1:N:0:NTCACGTT
AAACTTAAAGGAATTGACGGGGGCCCGCACAAGCAGCGGAGCATGTGGTTTAATTCGANGCAACGCGAAGAACCTTACCTAGACNTGACATCTCCTGAATTACTCTGTAATGGAGGAAGCCGCTTCGGTGGCAGGAAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTCTTGTTANTTGCTACCATTTAGTTGAGCACTCTAGAGAGACTNCCCGGGTTANCAGGGAGGAAGGTGGGGATGACGTCAAATCATCATG
+

I only R1 and R2 files and no metadata. What can I do?

thermokarst · April 2, 2019, 1:34pm

You need to create a metadata file with information about your sample IDs and your barcodes. Did you check out the cutadapt tutorial I linked to above? Once you have that file, you can proceed to demultiplexing.

nbourre · April 2, 2019, 1:48pm

I'm currently in the Metadata tutorial and have not done the cutadapt. I'll be doing it later today.

The biologist told me that we don't have a barcode file, but only forward and reverse.

I have done the demux with my files with the help of another tutorial.

thermokarst · April 2, 2019, 1:52pm

I think maybe you misunderstood me --- I said you need to create a metadata file with information about which barcode maps to which sample --- this has nothing to do with a sequencing barcode file. There are at least 2 examples of these kinds of files in the many resources I have linked you to above.

Care to share? It isn't really helpful to any future readers of this thread if you don't provide some details about what worked for you.

nbourre · April 2, 2019, 6:18pm

Ok, I have done the "Moving Pictures" with my data up to the "FeatureTable and FeatureData summaries" part because I did not have the TSV file.

I also tried the "Fecal microbiota transplant (FMT) study: an exercise", up to the "Diversity analysis" which I have no clue what to do.

I hope I'm clear enough, if not ask me.

thermokarst · April 2, 2019, 7:25pm

How did you demultiplex if you didn't have a barcode sequences file?

nbourre · April 2, 2019, 7:44pm

I think the data is already demuxed. This is my current script

#Import files
qiime tools import \
        --type 'SampleData[PairedEndSequencesWithQuality]' \
        --input-path manifest.csv \
        --output-path paired-end-demux-qza \
        --input-format PairedEndFastqManifestPhred33

#Create a visualization
qiime demux summarize \
        --i-data paired-end-demux.qza \
        --o-visualization paired-end-demux.qzv

I've been on this case for like 8 days (1 day/week). I still don't understand what to do.

nbourre · April 2, 2019, 7:57pm

If I send you sample files, could you give me a hand on what pipeline I could use?

Thanks

nbourre · April 2, 2019, 8:17pm

So here's my full bash script.

import files
manifest.csv content : path to the fastq files.
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest.csv
--output-path paired-end-demux-qza
--input-format PairedEndFastqManifestPhred33

#Create vizualisation of the imported files
qiime demux summarize
--i-data paired-end-demux.qza
--o-visualization paired-end-demux.qzv

quality filter
qiime quality-filter q-score
--i-demux paired-end-demux.qza
--o-filtered-sequences demux-filtered.qza
--o-filter-stats demux-filter-stats.qza

deblur
#TODO : What filtering should I user
qiime deblur denoise-16S
--i-demultiplexed-seqs demux-filtered.qza
--p-trim-length 120
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza

#Renaming
mv rep-seqs-deblur.qza rep-seqs.qza
mv table-deblur.qza table.qza

#FeatureTable and FeatureData summaries
#TODO : Missing TSV

tree for phylogenic diversity analysis
qiime phylogeny align-to-tree-mafft-fasttree
--i-sequences rep-seqs.qza
--o-alignment aligned-rep-seqs.qza
--o-masked-alignment masked-aligned-rep-seqs.qza
--o-tree unrooted-tree.qza
--o-rooted-tree rooted-tree.qza

alpha and beta diversity analysis
#TODO : Missing TSV

nbourre · April 3, 2019, 8:02pm

Is that a valid metadata file that can describe the samples? If so, I feel a bit dumb. I have built this file a long time ago by hand. I just did not associate this to metadata... I just called it a manifest.

sample-id,absolute-filepath,direction
sample-1,$HOME/Desktop/XXX/data_test/MI.M03992_0107.001.BioOHT_1rc.SCI017689_PCR2-1-XXX_R1.fastq.gz,forward
sample-1,$HOME/Desktop/CNETE/data_test/MI.M03992_0107.001.BioOHT_1rc.SCI017689_PCR2-1-XXX_R2.fastq.gz,reverse

nbourre · April 3, 2019, 8:13pm

I think I getting to something. I just find out that the name has a meaning which followed the Illumina standards and the lines inside the FASTQ file are some kind of descriptor. Maybe I do not need the metadata.

thermokarst · April 4, 2019, 4:38pm

@nbourre --- I suggest you take a step back and regroup.

Is your source data multiplexed or demultiplexed? You mentioned above that you only had two files. My understanding is that you have two, multiplexed files --- one for forward reads, one for reverse reads. This kind of data will need some list of which barcodes map to which samples, it is the only way you can demultiplex these data. Please note, this isn't specific to QIIME 2, this is going to be a required step with any tool you use. So, you will need to assemble the list of which barcodes belong to which samples. There is no way forward without that, assuming your data is multiplexed. Once you have that, you can follow the q2-cutadapt demultiplexing protocol I linked to above, but, let's wait on that and instead sort out the confusion around your source data.