NCBI data: weird interactive quality plot

Lacona · April 8, 2021, 2:13pm

Hey all,

for a meta-analysis I had to import data from different studies. Each sample had only one fastq file.
Based on that I imported the files using the fastq manifest format:

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path se-33-manifest \
  --output-path single-end-demux.qza \
  --input-format SingleEndFastqManifestPhred33V2

These weird quality plots of two studies look like there are forward and reverse read in one fastq file:

This is the quality plot of the other datasets:

For meta-analysis I have to trim and truncate all datasets the same way.
Could you please tell me how to deal with those weird datasets so I do not mess up my meta-analysis?

Thank you!

SoilRotifer · April 8, 2021, 3:11pm

Hi @Lacona,

Can you provide us more details on how you are obtaining the data from NCBI? Are you pulling from the ftp site, sra-toolkit, fastq-dump, etc.? It appears to me that the files are either interleaved (i.e. forward and reverse reads in the same file), and/or simply joined end-to-end in a couple of ways (i.e. not truly merged... probably output as simply concatenated reads / spots). Usually, when using fastq-dump you have to set options like --split-3, --split-files, etc...

When I'm not using command-line tools like sra-tookit & fastq-dump, I often find it easier to download GenBank SRA data from ENA (they often sync with each other). They typically have everything already separated by forward and reverse reads for each sample.

Lacona · April 8, 2021, 3:29pm

Thank you so much for the quick reply!

I used the sra-toolkit with fastq-dump, without any splitting. Because, looking at the files I would never have expected anything else than single-end-reads.
for ((i = yy; i <= yy; i++)); do fastq-dump --accession SRAxxxx$i; done
Metadata and data-description told me nothing about interleaved either. How do I know where to split etc?

I have one study with data uploaded to ENA. After installing the enaBrowsertool, I realised that there is no possibility to download the samples like HGxxxx-HGxxxx. The study only provided the sample-accessions and there is no study accession number and nothing. That's why I didn't include this study.

SoilRotifer · April 8, 2021, 3:40pm

Yeah, it can be confusing when you do not know what to look for. I've been caught by this myself. The trick is to look at the run selector. So, if you click on the SRA Experiments 66 in the link you provided, it'll take you here. From this, you clink on the link Send results to Run selector, and look at the Common Fields... you'll see that the data have been submitted as PAIRED. You can also view all the metadata for each sample on this page too.

This data is also available on the ENA. You can see they nicely separated the forward and reverse reads for you.

From here I often simply import the data using a Manifest.

-Mike

system · May 9, 2021, 9:40pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.