I have paired end sequencing data generated by Novogene. I have a few options of what to import. They send raw data (ideally I would like to use this): forward and reverse reads with barcode and primer included, clean data: forward and reverse reads with primer and barcode removed, extended reads: aligned read pairs.
All these are .fastq files, the primer and barcode information for all samples is contained in a .xls file.
The barcode information is also in a .fastq file for each sample. Do I need to create a fastq file for each sample? Can I somehow use the .xls file provided? I am also a bit confused when the primers are removed in the workflow. Is that why the tutorials have a barcode.fastq file for each sample, it includes the barcode and primer information for each sample?
Currently all these files for each sample are in a folder named after the sample. I think I need to move all the forward files into 1 folder and zip it, then do the same in a different folder for the reverse reads? Is this correct?
Sorry I just don't really understand how to organize the files to properly import them. Perhaps I haven't found the proper tutorial?
For each sample, you can indicate the absolute path for forward and reverse files.
My suggestion is:
Import raw data (or clean)
Remove primers with cutadapt. When sequencing centers say that they have already removed primers, usually, it means that they removed adapters that they added for sequencing. You can try to remove your biological primers while discarding all the reads without them - it serves as an additional control (too small output files means that there is something wrong with command/primers).
Then proceed to dada2/deblur or other method for denoisng.
For each sample, you can indicate the absolute path for forward and reverse files.
Thank you I couldn't find this import tutorial, this is definitely what I needed.
Remove primers with cutadapt. When sequencing centers say that they have already removed primers, usually, it means that they removed adapters that they added for sequencing. You can try to remove your biological primers while discarding all the reads without them - it serves as an additional control (too small output files means that there is something wrong with command/primers).
Then proceed to dada2/deblur or other method for denoisng
For the sequences with barcodes and primers that are already demultiplexed how would I remove the barcodes? I can't run demux, would you just trim that number of bases from the left?
I only extract the total DNA from my samples and Novogene does the 16S PCR so I think the clean sequences should have the biological primers removed as well. I will email Novogene to confirm but I think the barcode and primers removed files have had the 16S primers removed.
What would happen if I run cutadapt and the primers have already been removed? Would it throw out all the sequences?
I have been paying for the simple bioinformatics using qiime but I am hoping with a bit of studying I can create a qiime2 pipeline and start using the raw data, saving a bit of money and upgrading in the process.
Trimming is not recommended since it creates artificial increase in the diversity
I would suggest to remove primers with cutadapt - it also should remove any sequences prior the primer, including barcodes.
In that case you can use clean data, but sometimes sequencing centers also run some quality control tools on it, which is not recommended for Dada2. Usually, they have report file with details about the process, check if it is there.
Nothing if you will not provide "discard untrimmed" flag and almost empty file (or even error?) if provided.
That sounds like the path I would choose as well!
Good luck,