Preparing Data for Import

Sorry if this should be in a different area.

I have paired end sequencing data generated by Novogene. I have a few options of what to import. They send raw data (ideally I would like to use this): forward and reverse reads with barcode and primer included, clean data: forward and reverse reads with primer and barcode removed, extended reads: aligned read pairs.

All these are .fastq files, the primer and barcode information for all samples is contained in a .xls file.

In the example data provided here: Importing data — QIIME 2 2024.10.1 documentation

The barcode information is also in a .fastq file for each sample. Do I need to create a fastq file for each sample? Can I somehow use the .xls file provided? I am also a bit confused when the primers are removed in the workflow. Is that why the tutorials have a barcode.fastq file for each sample, it includes the barcode and primer information for each sample?

Currently all these files for each sample are in a folder named after the sample. I think I need to move all the forward files into 1 folder and zip it, then do the same in a different folder for the reverse reads? Is this correct?

Sorry I just don't really understand how to organize the files to properly import them. Perhaps I haven't found the proper tutorial?

Thank you for your help,
Bonita

Hello!
I can understand your confusion I felt the same with my first dataset!
Hope that I will be able to clarify it a little bit for you.

You can choose either raw data (my personal choice) or clean data (if you are sure that they only removed primers/barcodes).

Good, you need barcodes if your files are still multiplexed (by one fastq for all forward and reverse reads pooled together).

If I understood your post correctly, your files are already demultiplexed, meaning that they are already separated by each sample.

Nope, you can use manifest format: Importing data — QIIME 2 2024.10.1 documentation

For each sample, you can indicate the absolute path for forward and reverse files.

My suggestion is:

  • Import raw data (or clean)
  • Remove primers with cutadapt. When sequencing centers say that they have already removed primers, usually, it means that they removed adapters that they added for sequencing. You can try to remove your biological primers while discarding all the reads without them - it serves as an additional control (too small output files means that there is something wrong with command/primers).
  • Then proceed to dada2/deblur or other method for denoisng.

Best,

Thank you so much for the quick reply. I think I understand a lot more but I have a couple follow-up questions.

Nope, you can use manifest format: Importing data — QIIME 2 2024.10.1 documentation

Blockquote Nope, you can use manifest format: Importing data — QIIME 2 2024.10.1 documentation

For each sample, you can indicate the absolute path for forward and reverse files.

Thank you I couldn't find this import tutorial, this is definitely what I needed.

  • Remove primers with cutadapt. When sequencing centers say that they have already removed primers, usually, it means that they removed adapters that they added for sequencing. You can try to remove your biological primers while discarding all the reads without them - it serves as an additional control (too small output files means that there is something wrong with command/primers).
  • Then proceed to dada2/deblur or other method for denoisng

For the sequences with barcodes and primers that are already demultiplexed how would I remove the barcodes? I can't run demux, would you just trim that number of bases from the left?

I only extract the total DNA from my samples and Novogene does the 16S PCR so I think the clean sequences should have the biological primers removed as well. I will email Novogene to confirm but I think the barcode and primers removed files have had the 16S primers removed.

What would happen if I run cutadapt and the primers have already been removed? Would it throw out all the sequences?

I have been paying for the simple bioinformatics using qiime but I am hoping with a bit of studying I can create a qiime2 pipeline and start using the raw data, saving a bit of money and upgrading in the process.

Thank you very much for your help!
Bonita

Trimming is not recommended since it creates artificial increase in the diversity
I would suggest to remove primers with cutadapt - it also should remove any sequences prior the primer, including barcodes.

In that case you can use clean data, but sometimes sequencing centers also run some quality control tools on it, which is not recommended for Dada2. Usually, they have report file with details about the process, check if it is there.

Nothing if you will not provide "discard untrimmed" flag and almost empty file (or even error?) if provided.

That sounds like the path I would choose as well!
Good luck,

Thanks so much!

I would suggest to remove primers with cutadapt - it also should remove any sequences prior the primer, including barcodes.

Thanks, I didn't understand that cutadapt also removed sequences before the primers.

Now to try and train a classifier for the V3-V4 region!

1 Like