Importing mixed datasets paired and unpaired FASTQs already cleaned up/QCd

mmelendrez · April 4, 2018, 3:03pm

Hi - so I 'hearted' a few threads on dada2 errors as it looks like 'alas' my data is also too large to process with dada2 so I'm looking at my options.

The easiest option is to be able to import a mix of paired and unpaired data...which is a smaller dataset. I did a mapping out of a tree genome as I am interested in it's endophytome (microbial composition on it's leaves). During the mapping out two of the samples were not equal in terms of FASTQ sequences so I had to process them in bowtie differently (I don't now how that happens with FASTQ Illumina output but I suspect their prefiltering QC did it. This means I have two files of paired-end fastq reads (after filtering out the tree) for two of the samples and two files of unpaired fastq reads from the other two samples. I did not see a manifest format or importing option on the tutorial for importing mixed datasets.

Do I just import them separately? Two as paired end and two as unpaired? And if I do that am I able to combine them as one artifact so all 4 can ultimately be compared or leave them as two?

Thanks!

thermokarst · April 4, 2018, 9:52pm

I don't entirely follow your question here, but if you are able to import the PE reads separately from the SE reads, you can process them using whatever flow you would like, individually (e.g. deblur, dada2, vsearch). Once you have representative sequences and a table for each "set" of data, you should be able to merge the resultant files, then do all of your downstream processing. That is all from a "functional" perspective, but I will ask @Nicholas_Bokulich to "QIIME" in on the "academic" considerations here (for example, I suspect that this could have a negative impact on things like taxonomic classification). Please see this tutorial for merging multiple runs, which is the closest analogy we have to your specific case.

Nicholas_Bokulich · April 5, 2018, 12:24am

I am also not following your question. This experimental setup is not clear to me but here goes:

Yes, you will want to process these separately as @thermokarst advised.

Ultimately you will want to merge these (as @thermokarst indicated) if you wish to compare these samples. However, importing/merging is the least of your worries. If these samples are processed differently, e.g., with different final read lengths (after joining paired-end reads, that is), then each sample will have 100% unique features and you cannot compare using sequence variants. You will need to:

use q2-fragment-insertion to compare samples with discontiguous features.
assign taxonomy and use taxonomic assignments as features for comparing samples (e.g., alpha, beta diversity, ancom). The issue with this approach is that sequences of different lengths may well be assigned to different taxa, leading to the same problem of unique features...
trim your joined paired-end reads to the same length as the single-end reads. In which case you may as well:
process all samples as single-end reads with the same parameters.

I personally prefer #4. Trying to compare samples that have been processed with different pipelines/parameters can be a major challenge (and this is not a problem specific to QIIME2).

Your dataset is not too large, and dada2 is in no way the constraint here (unless if I misunderstood). The constraint would be that your computer does not have enough memory.

What does this mean? Shorter than others?

This is not a default of Illumina, as far as I know, but it sounds like perhaps your sequencing center or service is performing some kind of pre-trimming as part of their QC. I would recommend discussing this with them — you should get the rawest form of the data possible and process entirely within QIIME2. In general, pre-processing with other programs only increases the likelihood that some incompatibilities with QIIME2 can be introduced (e.g., if these programs alter the expected formats or trim sequences enthusiastically).

Good luck!

mmelendrez · April 5, 2018, 2:53pm

Ok I will try that - looking at the tutorial link. Essentially I eventually want to compare the microbial compositions between samples (N=4) and I didn't know if I needed to have them all combined into one artifact associated with metadata to do that in QIIME2?

But I am fine uploading/preprocessing them via dada2 or whichever workflow to clean them up as long as I can merge everything for downstream analysis. So I'll look at that.

Nicholas_Bokulich · April 5, 2018, 3:22pm

yes, you will need to merge these but that can happen one you have feature tables for each distinct sequencing run (that is when you will want to merge if you are running dada2 on multiple different sequencing runs)

N=4 probably is not a large enough sampling size for statistically comparing these samples downstream, so just be warned that some steps may fail. However, this should be fine for building taxa barplots and calculating alpha/beta diversity (without stats), so you will want to merge samples prior to those steps.

mmelendrez · April 5, 2018, 4:21pm

@Nicholas_Bokulich @thermokarst fair enough - let me clarify with my colleague about the data files and sequencing design. Nevertheless it is useful how to approach mixed dataset upload! Thank you.

system · May 6, 2018, 10:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.