Importing raw data - directory issue

Cybele_C · July 23, 2018, 2:30am

Hello! I've been stuck on importing my Illumina data for a few days, and it probably relates to the pathway. I've had two types of error messages, below:

(qiime2-2018.6) Cybeles-MacBook-Pro:qiime2-skate3 cybelecollins$ qiime tools import \

--type EMPSingleEndSequences
--input-path /Users/cybelecollins/qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz
--output-path skate.qza
There was a problem importing /Users/cybelecollins/qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz:

Importing 'EMPSingleEndDirFmt' requires a directory, not /Users/cybelecollins/qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz

And

(qiime2-2018.6) Cybeles-MacBook-Pro:qiime2-skate3 cybelecollins$ qiime tools import \

--type EMPSingleEndSequences
--input-path qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz
--output-path skate.qza
Usage: qiime tools import [OPTIONS]

Error: Invalid value for "--input-path": Path "qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz" does not exist.

I have created the directory, activated it (mkdir and cd commands), and moved my raw data and barcode metadata files into the folder.

Thank you for your help

John_Chase · July 23, 2018, 10:28pm

Hello!

I think that you are actually quite close to the correct command however, I believe you have one small error.

In the first command it appears that the path you are providing to the input files is correct, however you are providing the path to a specific file whereas the import type you are passing is expecting a path to a directory. For example:

qiime tools import \
–type EMPSingleEndSequences 
–input-path /Users/cybelecollins/qiime2-skate3/emp-paired-end-sequences
–output-path skate.qza

In the above example the emp-paired-end-sequences directory should contain a file named sequences.fastq.gz and a second file named barcodes.fastq.gz Notice how I removed the file name from the path.

In the second command the error you are getting is letting you know that the file is not found at all. You can check this for yourself by running:
ls qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz

if the file is not present the terminal will output:
ls: cannot access 'qiime2-skate3/emp-paired-end-sequences/barcodes.fastq.gz': No such file or directory

If the file is present it will simply return the path.

Hope this helps!

Cybele_C · July 26, 2018, 3:27pm

Thanks very much! I did not have the right kind of file for barcodes, but I might have a more serious issue. The sequencing facility works with BaseSpace and when my PI wrote to them, the response was problematic - enough that I might have to start a different thread and search more for what I can do now:

"BaseSpace doesn't generate an index read file. The data just comes demultiplexed. I've been trying to get my off-instrument computer to generate a index read file for you, but it looks like a necessary file for this may have been corrupted or lost during data transfer. I'll keep working on this, but I'm not sure how successful I'll be. I'm not sure what you're looking at in the Fastq files, but I think that the Illumina software does allow for a 1 bp difference to still assign it to the correct index."

Cybele_C · July 26, 2018, 3:27pm

Perhaps this is similar to these issues, in working with demultiplexed data? Is this a manifest file?

Thanks!

thermokarst · July 26, 2018, 3:29pm

Yep! There is no requirement that you work with multiplexed data - in fact, according to an informal survey we did here, most people seem to jump into QIIME 2 directly with demux data. Sounds like you are on the right track!

Cybele_C · July 30, 2018, 4:24pm

I am working with MiSeq samples from BaseSpace that have R1 and R2 , and trying to find a protocol since the BaseSpace data is demultiplexed. I did make a metadata file with indexes, but am not sure where this fits.

These these are manifest files, as I understand, and I am entering these commands:

sample-id,absolute-filepath,direction
sample-1,/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1_S1_L001_R1_001.fastq.gz,forward
sample-2,/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1_S1_L001_R2_001.fastq.gz,reverse

But for each file, I get (qiime2-2018.6) Cybeles-MacBook-Pro:~ cybelecollins$ sample-1,/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1_S1_L001_R1_001.fastq.gz,forward
-bash: sample-1,/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1_S1_L001_R1_001.fastq.gz,forward: No such file or directory

although the ls function shows the file.

Thanks for your help!

thermokarst · July 30, 2018, 10:56pm

Hi @Cybele_C!

Let's take a step back here and rehash.

This is not necessary, since your data is already demultiplexed (meaning, split by sample). The only reason you would need barcodes in your metadata file is if you had multiplexed reads that still needed to be demultiplexed (the barcode is how you tell the computer which sample a particular read belongs to).

The manifest file is a file, not a command. As the docs say, this should be a CSV file (comma-separated values) --- this file just tells QIIME 2 which file belongs to which sample, and that file's read orientation.

Please double-check the documentation.

Keep us posted. :qiime2:

Cybele_C · August 17, 2018, 12:12am

Hello again - I'm afraid that I'm still at the point from two weeks ago. I've attached what I see on my screen. I think this has to do with file paths. Thanks! -

thermokarst · August 17, 2018, 1:27pm

Hey there @Cybele_C!

I think you have everything you need right there in that error message!

Path "Manfest1" does not exist.

Based on the screenshot you provided, I agree with that error! I don't see a file called "Manifest" --- but I do see one called "Manifest.csv".

Once you fix that part of your command, you will bump into another problem related to the paths listed in your manifest file:

/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1_S1_L001_R1_001.fastq.gz
/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1_S1_L001_R2_001.fastq.gz

Those files don't look like they exist to me, either, there are no files sitting in the paired-end-sequence dir in your screenshot! Perhaps you meant to write:

/Users/cybelecollins/qiime2-skate3/paired-end-sequences/CC1-6/...

I assume there are files in that dir, but I don't know what their names are because the screenshot hasn't expanded that folder.

Okay, the last thing about your manifest that jumped out at me --- you listed the forward reads for pair CC1 as sample-1 and the reverse reads as sample-2 --- this will import as two separate samples, but I suspect what you really want is to import the pair as sample CC1. Just update the first column to have the sample name for both rows --- whatever name you give it here is the name the sample will have for the rest of the analysis.

Keep us posted! :qiime2:

Cybele_C · August 17, 2018, 3:51pm

Thanks! I wonder now if there's a problem with my files in themselves (as even the Illumina analysis showed very low PF scores and the sequencing facility admitted to a possible loading issue) and I got this message:

An unexpected error has occurred:
Decoded Phred score is out of range [0, 62].

thermokarst · August 17, 2018, 3:56pm

This error is because you specified PairedEndManifestPhred64 in your import command --- this should only be used if you know that your reads are phred 64 encoded. Most likely they aren't, so you should use PairedEndManifestPhred33. Please take some time to review the docs - this info is listed there.

Cybele_C · August 17, 2018, 4:00pm

Thank you - I think what is relevant is this link, not in the main doc but linked: http://scikit-bio.org/docs/latest/generated/skbio.io.format.fastq.html#quality-score-variants

Cybele_C · August 17, 2018, 4:25pm

It did work. Thank you!

With total gratitude, and perhaps to reduce the volume of questions, if you are updating this doc anytime and would like to make it more accessible to those without much experience, a few tweaks might reduce errors. These can be hard to catch when something is new and overwhelming (ex: the tutorial gives "pe-64-manifest" as the example, rather than "pe-64-manifest.csv", which might be obvious to some people since it is the total file name, but not when imitating something blindly. Also, the main doc does link to a way to determine the format, but in implication (highlighted link) rather than clear direction or within the main text.)

But yes, for the most part, the information is all there, and hopefully I will be able to self-correct more as familiarity increases. One just feels very blind at first, so this forum has been invaluable.