Importing a single fastq file into QIIME2

Hi,

I have a single fasta file and a qual file from 454 sequencing that I want to import into QIIME2.
From previous posts I went ahead and converted these two files into a fastq file in qiime1. I see that the next recommendation is to make a manifest file but the fastq file I have has the joined reads in the same file and is not separated into forward and reverse files. I am curious what the best approach would be to get these into qiime2.

I’m assuming the next step is to split my fastq file into a forward.gz file and a reverse.gz and then import these and the mapping file into qiime2.

Here’s a printout of the fasq file I converted from qiime1:

(base) [[email protected] fastq_files]$ head DS.fastq
@M02542:94:000000000-AH3JY:1:2116:18350:1966
CTTTTAATAGGGTTTGATCATGGCTCAGGATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGTACGAAGTAGCTTGCTACTTAGTGGCGAACGGGTGAGTAACACGTAGGTAATCTGCCCTTATGACGAGGATAACTATTGGAAACGATAGCTAATACTGGATAGGATAATATTTCGCATGATATATTATTTAAAGATCCGTTTGGATCACGTAAGGAGGAACCTGCGGCGCATTAGCTAGTTGGTAAGGTAACGGCTAACCAAGGCAATGATGCGTAGCCGTACTGAGAGGTTGAACGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATTTTCGGCAATGGAGGAAACTCTGACCGAGCAACGCCGCGTGAATGAAGAAGTATTTCGGTATGTAAAATTCTTTTATTAGGGAAGAACTGACTTAGTAGGAAATGACTAGGTTTTGACGGTACCTAATGAATAAGCCCCGGCTAACTACGTGCCAGCCGCCGCGGTAAGAC
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG[email protected]FFGGFGGGGDGGGGGGGFEGGGGGCFGGGGGGGGGGGGGGGGGGFGGGGG=EFECGGCFCGGFGGGGGGGGGDGGGG9<F?CE+A7:AEJJJJJJJG?JJ7JJ,[email protected]%&?IJDJJJJJA%:[email protected]JC5E-:JGJ8(88*?A3CGFG?)?,4FFDGFD:F9GFFGGGFFE9AFE:GGGGGDCCFCFEC9FA=<:,[email protected]=BD=,[email protected]GDCGEGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC
@M02542:94:000000000-AH3JY:1:1114:25940:7285
CGTAGATAAGAGTTTGATCATGGCTCAGGACGAACGCTGGCGGCGCGCCTAACACATGCAAGTCGAACGAAGCGAAGGAGCTTGCTCCCTAGCTTAGTGGCGAACGGGTGAGTAACGCGTGAGTAACCTGCCTTAAAGAGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATAAGCCCACGGTGCCGCATGGCACAGAGGGAAAAGGAGCAATCCGCTTTAAGATGGACTCGCGTCTGATTAGCTAGTTGGTGGGGTAATGGCCTACCAAGGCGACGATCAGTAGCCGGACTGAGAGGTTGAACGGCCACATTGGGACTGGGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGGATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGGAGGAAGAAGGTTTTCGGATCGTAAACTCCTGATATAGATGACGAAACAAATGACGGTAATCTATAAGAAAGTGACGGCTAACTACGTGCCGGCAGCCGCTGTAACAC
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG[email protected]JJJ9J-JJJJJJJJJJJJJJJJJJJJJJJ>JJJJJJJJJJDJJJJJJJJJJJJFFF8;:):GFF7C<GFF?CC9CCF=,GFDGEC7
C=,FFD,C>E:DGGGGGGGGFGGGDGGDFF=FGGGGGGGGGGEFGFGGGGGGFEGGGGGGGGFBGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC
@M02542:94:000000000-AH3JY:1:2101:19996:11450
CTGGATCCAGGGTTTGATCATGGCTCAGGATGAACGCTAGCGACAGGCTTAACACATGCAAGTCGAGGGTTAACATGATGGTTGCTTGCAACCATTGATGACGACCGGCGCACGGGTGAGTAACGCGTATGTAACCTGCCTTATACAGGGGGATAGCCCATGGAAACGTGGATTAACACCGTATAATACTATGATTAGGCATCTAATTTTAGTTAAATATTTATAGGTATAAGATGGGCATACGTCCTATTAGATAGTTGGTGAGGTAACGGCTCACCAAGTCATCGATAGGTAGGGGTTCTGAGAGGAAGATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGACGCAAGTCTGAACCAGCCAAGTCGCGTGAAGGATGACGGTTCTATGGATTGTAAACTTCTTTTATTTGGGAAGAATAATAACTACGTGTAGTTAGATGCCTGTACCAAATGAATAAGCATCGGCTAACTCCGTGCCAGCCGCCGCTGTAATAC

Any help would much be appreciated!!
Sam

Hello Sam,

I think some detective work is in order! :face_with_monocle: :mag:

Let’s start here: Are you sure your 454 data is paired reads, and not just forward reads or mixed orientation reads? I didn’t think 454 reads could be paired… but I could be wrong!

Colin

Hi Colin,

Sorry I mispoke. They are ilumina but I think the company MRDNA converted them to joined.

Thanks,
Sam

Hello Sam,

Ah OK. That makes sense.

Did MRDNA join the paired reads, or did it ‘interleave’ them? If these are joined, you could request the unjoined reads from the company, or you could process them as single end reads. If these are interleaved, you could convert this back into two, normal, non-interleaved fastq files, then import that like normal.

All good options, but we have to verify the format first.

Colin

Hi Colin,

The protocol they sent us uses the term joined so I would assume they are. I don’t think we can ask for the original data as it was years ago. The lines I pasted also suggest they are joined no?
Thanks,
Sam

They sure look joined to me!

With that in mind, I would import them using the Fastq manifest format. Treat your reads as ‘single-end reads’ as they are basically single end from the perspective of the pipeline.

Let me know how well this works for you. Keep in mind your analysis may be somewhat limited by the fact that you only have a single sample, but you should totally be able to import a single sample

Colin

Hi Colin,

By single fasta file/qual file I meant that I have a single file but within it contains ~100+ samples of the entire run. Would I still be able to do this with the fastq manifest format? It looks like by the tutorial that I would have to split my file into multiple files containing one sample.

Thank you!
Sam

Got it!

OK, ok… How can you tell which reads are from which samples based on the files you do have? Do you also have an index file?

Colin

Hi Colin,

Yes there is a corresponding mapping file with the barcodes which I’ve attached. Could this allow for separating the fasta file into multiple files by sample?

070615JC27F-mapping2.txt (9.0 KB)

That looks like a great mapping file, but I’m not sure how best to use it with your complete, paired, fastq file. Do these primers appear as a subsequence of your paired reads? Or maybe their reverse complement appears somewhere in your reads?

We are back to the mystery, so any clues you find could be very important! :face_with_monocle: :mag:

Wish us luck!
Colin

Hi Colin,

Yes according to MRDNA’s protocol the fasta file contains barcodes so they should be in there. I searched for two different barcodes in the fasta file and one showed up 80,000+ times and the other showed up about 100,000+ times. There are a total of 6M reads. Hope this clue helps!
Sam

Indeed!

OK, so if the barcodes are in the sequences, try one of these:
https://docs.qiime2.org/2019.7/plugins/available/cutadapt/demux-single/
https://docs.qiime2.org/2019.7/plugins/available/cutadapt/demux-paired/

Good luck! I think we are almost there!
Colin

Hi Colin,

So I first tried to import as paired end which ended up not working (I split the fastq file into forward and reverse.gz files by separating every 4 lines). The import worked but the actual cut adapt did not. I got this error:

(qiime2-2019.7) [[email protected] fastq_files]$ qiime tools import \

–type MultiplexedSingleEndBarcodeInSequence
–input-path import
–output-path multiplexed-seqs-single-fullfastq.qza
Traceback (most recent call last):
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/q2cli/builtin/tools.py”, line 154, in import_data
view_type=input_format)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/result.py”, line 241, in import_data
validate_level=‘max’)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/result.py”, line 267, in _from_view
result = transformation(view, validate_level)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/core/transform.py”, line 68, in transformation
self.validate(view, validate_level)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/core/transform.py”, line 143, in validate
view.validate(level)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py”, line 171, in validate
getattr(self, field)._validate_members(collected_paths, level)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py”, line 101, in _validate_members
self.format(path, mode=‘r’).validate(level)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/plugin/model/file_format.py”, line 24, in validate
self.validate(level)
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_types/per_sample_sequences/_format.py”, line 279, in validate
self._check_n_records(record_count_map[level])
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_types/per_sample_sequences/_format.py”, line 239, in check_n_records
for i, record in file
:
File “/data/anaconda/envs/qiime2-2019.7/lib/python3.6/encodings/ascii.py”, line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0x8b in position 1: ordinal not in range(128)

An unexpected error has occurred:

‘ascii’ codec can’t decode byte 0x8b in position 1: ordinal not in range(128)

So then I tried importing as single end which worked.

qiime tools import
–type MultiplexedSingleEndBarcodeInSequence
–input-path import
–output-path multiplexed-seqs-single.qza

qiime cutadapt demux-single
–i-seqs multiplexed-seqs-single.qza
–m-barcodes-file DSmapping2.txt
–m-barcodes-column BarcodeSequence
–o-per-sample-sequences demultiplexed-seqs-single.qza
–o-untrimmed-sequences untrimmed-single.qza
–verbose

But then I get this result from the demux.qzv

This implies that the forward and reverse sequences are joined on the same line in the fasta file right?
Thanks,
Sam

Hello again Sam,

This error is an easy one:

Either your fastq file is messed up during splitting, or quality scores need to be adjusted (say to 64 instead of 33).


This one is more interesting:

Looks like it! (Illumina doesn’t have score drop offs like this unless something went very wrong during the run.) And those super high q-scores from 225 to 275 could explain the not in range error you got earlier.

As far as I can tell, the q2-cutadapt doesn’t let you change the phred-offset, so I opened an issue!

Let’s see what the qiime devs recommend!

Colin

Hi Colin,

So just to confirm. Is my fasta file in the phred-64 format? I thought it was 33 because of the way it came out on the demux.qzv plot.
If it is phred-64 can I convert it to 33 in qiime1 and import it that way?
Thanks for all the help and I hope I’m not dragging this issue on too long!
Thanks,
Sam

I’m not sure about your file… I was just trying to find a way to address the cray high q-scores in part of your reads.

You could run vsearch --fastq_convert to convert your files. And of course you could install vsearch with conda install vsearch

EDIT: @Mehrbod_Estaki reminded me of the vsearch --fastq_chars command which will guess 33 vs 64 for fastq files. Try that too!

Colin