I am trying to analyze some data from the NCBI database that was sequenced using pyrosequencing. The files posted in the NCBI database are in fastq format. I assumed they would be fine to run as-is, but the quality scores are relatively low (bottoms out at ~550bp, with quality scores around 10 for everything longer than that). My PI believes that generally pyrosequencing would have higher scores, and is concerned that the files are not actually in the correct format. I found some posts indicating that 454 can be converted in QIIME1 to the proper format, but don't know how to check if this was done. I imported using --type 'SampleData[SequencesWithQuality]'
--input-format SingleEndFastqManifestPhred33V2. I would appreciate any guidance on either using a different import method (if mine is incorrect for these files), or help figuring out if the files were properly formatted (I don't know what to look for within the fastq file to confirm this, and am having trouble pinpointing this specific info online). If someone also knows that this is 'normal' or at least not unusual, when looking at pyrosequencing files in QIIME2, that would also be appreciated.
Sorry for the delay in response, I have been sick the past week. I appreciate someone helping me look into it, as the few people I know in my lab space who are familiar with 454 data have never used it for 16S/ QIIME work. Below is the quality score plot (and the sequence counts summary, in case that is useful). As I mentioned before, I don't really know how to tell if the files are 'properly formatted' beyond that these are fastq files, and the paper indicates that they were pyrosequenced samples. It appears that there were some samples prepped specifically for 16S, and some for whole gene/ metagenomic sequencing, based on the info that was provided, and only what was marked as 16S should be included in this plot. Thanks in advance for any help that can be provided!
Hi @Cassie, This actually looks pretty reasonable to me, presuming that the number of sequences longer than about ~350 bases is dropping really quickly. There is a sequence length summary at the bottom of the quality score plot page where you can review that. I've attached quality score summary from a 454 dataset that I've been working with that you can compare against.
Note for 16S analysis, you're going to want to trim those sequences pretty aggressively. I would recommend setting trunc_len to around 250 bases when you call denoise_pyro (assuming that's the workflow you're planning).