Multiple data types for import, which one to use?

LotteG · March 29, 2021, 1:10pm

Hi everyone,

After searching some time on the forum, I've come to the conclusion that my question might be worth adding to it!

I have revieved data from 16S seq and I've used Qiime2 with Casava1.8 before to get some great results by following the Moving pictures tutorial, but I'm a bit stuck on the importing. I received multiple types of data to work with, which i I will add below:

result/
|-- 00.RawData/	[Raw reads and merged reads]
|	|-- Sample_Name/	[Raw data and merged pair-end reads for each sample]
|	|	|--*_1.fq.gz	[Read 1 sequences with barcode and primer removed]
|	|	|--*_2.fq.gz	[Read 2 sequences with barcode and primer removed]
|	|	|--*.raw_1.fq.gz	[Read 1 sequences with barcodes and primers]
|	|	|--*.raw_2.fq.gz	[Read 2 sequences with barcodes and primers]
|	|	`--*.extendedFrags.fastq	[Raw Tags after reads merging]
|	|-- SampleSeq_info.xls	[List of barcodes and primers]
|	`--assembl_stat.xls	[Statistical form for reads merging process of all samples]
|-- 01.CleanData/	[Quality-controlled tags information]
|	|-- Sample_Name/	[Results of quality control for each sample]
|	|	|--*.fastq	[Clean tags(FASTQ format)]
|	|	|--*.fna	[Clean tags(FASTA format)]
|	|	`-- histograms.txt	[Length distribution of clean tags]
|	`-- QCstat.xls	[Statistical table for data pre-processing and quality control]

But I'm quite stuck on which one to use for the taxonomical analysis, as when I thought I could use:

the clean FASTA data, but I don't know how to get further than creating the FeatureData[Sequence] artifact, and for clustering I would need a FeatureTable[Frequency], but I don't know how to get there
Use the complete raw sequences and start from scratch, but the import type is lost on me because it's just 'sample name.fq' and I'm not sure whether to use Phred33 or 64,
use the clean FASTQ data, but without the trimming step because it has already been cleaned? But I need the artifacts from the denoise to proceed in the taxonomical analysis.

I feel like I'm missing something very obvious here, but I've been breaking my head over this for a week, so help would be much appreciated at this point!

Kind regards and thanks in advance for taking the time to read this,

Lotte

jwdebelius · March 29, 2021, 3:49pm

Hi @LotteG,

Welcome to the :qiime2: forum! Thanks for searching for your answer first

Unfortunately, I'm not sure I can offer you direct advice on your question, because I dont know how the file set was generated; you need to discuss that with whoever generated the files upstream of you. So, I would ask how the "clean" fastq files were prepared. It's not great to guess what someone else (occasionally past me included) did to their data unless it's documented.

As far as importing goes, once you figure out which fastq to use, I'd import via the manifest format. I'd probably start trying Phred33; if its wrong you'll get an error message and then just change the import. (Error messages are a totally normal part of an analysis!)

Once you're there, you can denoise your data - the Atacama Soils tutorial gives an example of how to do that with paired end sequences; plus there's always lots of discussion on the forum about what to do.

Then, you'll be read to classify taxonomy.

Best,
Justine

LotteG · March 30, 2021, 7:40am

Thank you so much for your quick answer Justine! I'll give the manifest format Phred33 a go and see how far I get with the Atacama Soils tutorial in combination with the discussion mentioned there.

Cheers,
Lotte

system · April 30, 2021, 1:40pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.