Starting qiime with list of ASVs per sample

I got metabarcoding sequence sample from NOVOGENE, they filtered, denoised all data and send us lists of ASV's in FASTA file, one for each sample. I am struggling how to start analyzing in qiime. I already merged and prepared a .qza file of the ASV's.Please guide

Hi @momay,

I've not worked with Novogene before, but my recommendation is that if you're not sure abut their bioinformatics to get your raw data. (My recommendation is always to get raw data as much as possible.) That way, you can process it and know exactly what aws done to generate it.

Best,
Justine

2 Likes

I understand the data processing. The process are pretty clear.
Now I have a big list of ASV's. To my understanding i need to generate a feature table, which I am struggling with. Is there any way Qiime can generate a feature table using ASV list or does it need raw data?

Hi @momay,

Before I jump in, I coudl look at the following:

  1. What kind of file did they send as your per-sample file? Fasta or Fastq
  2. What algorithm did they use to generate the per-sample file? (DADA2? Unoise? Deblur? Something else?)
  3. Were the files split before or after denoising?
  4. If they were split after denoising, whcih is usually done after dereplication, how did the map the replication back?

There may be a way to proceed, but it very much requires more details.

Best,
Justine

4 Likes

Hello!
I completely agree with @jwdebelius that it is better to work with raw data. The pipelines that sequencing centers use are designed to handle a broad range of datasets and therefore probably not as good as it could be for certain dataset.

In my experience with Novogene they send multiplexed raw data. Probably you ordered additional service for extra money for initial processing while I am using cheaper option, or it is somehow region specific.

In any case, I would write them and specifically ask for raw data. If it is impossible to get it, then you need carefully inspect their report in order to create Qiime2 files. If there is no report or it is incomplete, then request additional information from them.

Best,

4 Likes

We sequenced with Novogene for the first time in our lab a month ago. We received both raw and clean FASTQ, and I always start from raw data due to the reasons @jwdebelius and @timanix point out.

In your case @momay , I assume you used the Amplicon service. Reading the service flyer it is still not clear to me whether raw data is delivered. In our experience, Novogene gave us a 30 day grace period until they delete data so I would reach them asking for it.

Sergio

3 Likes

Thank you everyone. I am working on raw reads.

1 Like

Thank you for the reply. As you know NOVOGENE data I will ask a follow up question.
Our sample was sequence PE250, if we remove the primer and barcode that results ~ 220 bp long reads. Our amplicon size is 450bp (16SV3-V4). Sequence quality is very good. I am confused how to merge the file.

1 Like

Hi @momay,

Glad you figured it out!

I'm recommend using DADA2 and using paired end denoising. You can find an example of how the command is structured here. Dada2 will take care of quality filtering, read joining, denoising, and chimera filtering.

Best,
Justine

1 Like

Hey,
I am trying using DADA2, but what I understood that it needs bp overlap (min ~12). But the reads I have are about 220bp and amplicon size is 450bp. How this is going to overlap? Am I misunderstanding? The reads quality looks really good, the worst one with mean quality score 37.
Please suggest

Hi @momay,

It may not. I'd try trimming your primers with cutadapt first (the barcode should be upstream fo the primers) and then looking at your read summary. Based on that, you can make decisions about the overlap, and whether you can join your paired end reads.

There are cases where its not possible and people just work with the forward reads. I'll aks the others (@timanix and @salias) who work with Novogene data and maye they can qiime in!

Best,
Justine

2 Likes

Hello!
Usually I don't have any issues with Novogene data - but I work with either V1-V2 or V4 regions. We avoid using V3-V4 because of overlapping. There are three main groups of bacteria based on the length of V3-V4 region, so I am afraid that most of the studies with that region are biased and capture mostly 2 groups with shorter regions (420-430 bp). Even when sequenced with Illumina 300X2, usually quality at the ends is so bad that one need to trim them anyway.
As @jwdebelius suggested, I would remove primers first with cutadapt. Barcodes, if not already removed, should be located before the primer so they will be also removed by primer sequence. Then I would carefully asses the length and calculate if it is possible to merge the reads. There is an option to decrease minimum overlap in Dada2 settings, so it is not necessarily 12 bp.
If reads are too short, the workaround would be using only forward or reverse reads as single end data.

2 Likes

Thank you so much for all of your comments.
Which SILVA database do you recommend for V3-V4 region?

Hello!
Sorry for the late reply - somehow I didn't get the notification.
I can suggest using Silva 138.2 for taxonomy classification. You can either use one from the Resources page (full length classifier), or train your own based on the primers you used (best option but requires extra work). For the last option there is a great Rescript plugin and tutorials.
Other choices would be Greengenes2 or GTDB databases.
I have positive experience with all three databases.

Best,

3 Likes