Currently I'm trying to import sequence data into Qiime2 2019.4 via Virtual Machine. I'm already processed the fastq file by removing barcodes and linkers. When I go to import the sequences into Qiime2, it gives me an error:
There was a problem importing /home/qiime2/Documents/pacbio/qiime2/emp-single-end-sequences/:
/home/qiime2/Documents/pacbio/qiime2/emp-single-end-sequences/sequences.fastq.gz is not a(n) FastqGzFormat file:*
Quality score length doesn't match sequence length for record beginning on line 5*
I ran the normal import command with type as EMPSingleEndSequences with no option to run "--verbose" flag.
The first record is actually the problematic one, it's quality sequence is indeed shorter than the read. I don't know how this happened.
This is from a pacbio instrument, could you describe what kind?
It looks like the orientation (and perhaps even the location) is mixed, but without knowing the instrument, I am uncertain how to interpret the fastq headers.
What kind of data is this? Is it still amplicon, or are we working with shotgun/other?
Oh yeah? You can't just leave us hanging with that! What kind of steps did you end up needing to take? It might explain problem 1 from above.
In the meanwhile, using DADA2 directly as discussed in this paper, is probably your best bet!
I'm also very interested in what this process generally looks like from your (@kayleecastle) perspective, as I haven't had a chance to work with this kind of data before. What transformations did you have to do to get a fastq file? It appears that is not the native format for CCS data.
The raw fastqs that come out of the ccs/lima applications from amplicon data seem to typically be in mixed orientation, contain the primers, and have quality scores that range up to 93.
We will add a dedicated denoise-pacbio dada2 workflow at some point, but I might wait until R packagve version 1.12 propagates to bioconda/qiime2, as there were some pacbio fixes between 1.10 and 1.12 that make implementing such a workflow significantly easier.
Thank you for taking the time to respond! It is PacBio Sequel and is amplicon data. The company that does our sequencing provides a Fastq Processor, which we have used with 3 other sets of MiSeq data. Usually when processing the data, we remove linker primers, barcodes and reverse primers. This time we just left the reverse primers on since it was a new type of data set and giving errors with our normal parameters. We were going to address this further in Qiime2, but obviously no luck lol
Thank you for helping me with this issue!! QIIME2 team is awesome!
Anyways, I used a Fastq Processor that the sequencing company provides (http://www.mrdnafreesoftware.com/). The processor takes the Fastq file removes linker primes, barcodes and reverse primers, and zips it. Unfortunately I have only worked with MiSeq data prior to this so I'm pretty unfamiliar with PacBio/CCS data.