I look for this topic and i couldn´t find it , so i hope it is not repeated.
I have the following question:
I´m downloading microbiome data from NCBI (SRR). I understand (from moving pictures tutorial) that NCBI data that came from Illumina sequences, have adapters and primers that are included in the downloaded data.
When using fastp I understand that those adapters and primers are removed, but i would like to know if quiime2 also do this process when using DADA2 or do i need another command in quiime2 to do that or it is not supported ?
Yes, sequencing reads often includes non-biological data, so removing this is good.
You can use the Qiime2 cutadapt plugin for this:
Here is the twist: amplicon reads often include only the hypervariable region, with no adapters or primers at all!
How is this possible?
Well, we can reuse the PCR primers as sequencing primers, as popularized by the EMP protocol. In this case, the sequence matches the PCR product perfectly and we can run DADA2 directly on raw reads.
I read about cutadapt, but the thing is that, if I understood well, it needs the user to enter the exact sequence to cut it, but, if i don´t have that information ?
is there a way that automatically detect those kind of data and cut them ?
There are tools, such as fastp and fastqc, that can detect such adapters for you. They're usually called "overrepresented sequences" in the report. Neither of these tools is available through qiime2 at the moment, but there are plans to add fastp into a plugin in the next release.
I'd like to add, that all you need is the PCR primer sequence. You do not need the adapter or other sequences. This is because much of the adapter sequence is prior to the PCR primer sequence. Thus, if you find and remove the primer with cutadapt, any sequence prior to that primer sequence (i.e. towards the 5' end ) will also be removed along with it.
That sounds great, the thing is that in the NCBI i don´t see that data. For example, the following link is a bioproject using microbiome. If i go to the metadata part i dont´see the primer sequence or something related to.
I am assuming there is a manuscript associated with this data, and it should detail the primers used. I'll just assume it is likely V4 or V3V4. Also, depending on the approach they used, there may or may not be primer sequence contained within the read.