this may be a very easy question to answer, but I am wondering about the best workflow to get the maximum of OTUs out of my data. (I’ve been following the really helpful “moving pictures tutorial”).
1.) How many fastq files should I combine into one manifest.csv file? (I have around 85 fastq files from 5 sequencing runs) In order to use dada2, I have to decide at which length I want to cut my sequences. But: after exploring my sequences I noticed that the read quality is not homogenous:
1.Is it better to look at the quality drop in every single read individually or to look for the overall quality drop (which means losing some above-average reads)?
2.) Is is important for later analyses that the results of dada2 are all equal in length (e.g. for analysing sampling depth)
3.) Bearing in mind my machine’s capacities: Is it legit to only use the forward reads? In other words: What would I lose if I discard my reverse reads? Dada2 for three fastq took almost 24h and I would like to quicken this a bit.
I am sorry for this amount of question and I hope you are able to understand my problems.
I would combine all your samples into the manifest file, unless you plan to handle each sequencing run in parallel. If your 85 samples are the only samples you ran on those 5 sequencing runs, I would process import each run and then use deblur. (I find its more robust to parallel processing than Dada2 because it uses the same prediction model across all your sequencing runs, where as Dada2 trains its chimera-prediction model off each run.) If the samples are simply spread across five sequencing runs that were multiplexed with other samples, then you should probably just import them all and run them together.
I would set your threshold based on the average. Your ability to perform downstream analysis is based on having consistent length reads. You lose the above average reads, but you will be able to compare the features.
It is completely legitimate to look only at forward reads. Several reads use only their forward reads for various and sundry reasons (poor quality reverse reads, insufficient overlap based on read length, historical reasonsTM) so you can absolutely do it.
However, if computational power is your main limiting factor, I would explore options to get more computational resources. Many academic institutions have servers or computing clusters which are often under utilised and either already paid for or cheap to use. Amazon, Microsoft, and other major corporations also provide cloud-based super computers for hire where, for a few dollars, you can get a huge amount of compute.