Hello everyone,
I'm analysing microbial communities in a fjord and got stuck with the taxonomy assignment of fungi, I get 25-90% unassigned reads in my samples. The amplicons were generated using ITS3 and ITS4 as primers and the reads are 301bp long.
I read about the problem of read-through in the forum, so I used the cutadapt tool to remove the reverse complement of the reverse primer in forward reads and vice versa (side question: does cutadapt also find partial matches at the 3' end? The help function only says "partial matches allowed" for the 5' end - i.e. something like sequencesequencePRIM instead of sequencesequencePRIMER). Primers at the start were removed in the next step, I thought this is easier and safer to do, as they are always the first 20 bases and misread bases don't affect the trim-left parameter in dada2 while they could produce problems with cutadapt.
After that, I ran Dada2 for denoising, tried several settings to keep as many reads as possible while also not truncate too much to not lose species with particularly long ITS regions and managed to get ~80-90% through quality filtering, denoising, merging and chimera-removal, which was surprising as I couldn't get such high percentages with my other dataset of 16S sequences.
These were the settings I finally used:
qiime dada2 denoise-paired --i-demultiplexed-seqs ITS_cutadapt.qza --p-chimera-method consensus --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 300 --p-trunc-len-r 290 --p-max-ee-f 3 --p-max-ee-r 5 --p-n-threads 0 --o-table FeatureTable_Frequency --o-representative-sequences FeatureData_Sequence --o-denoising-stats SampleData_DADA2Stats
Then, I tried feature-classifier classify-consensus-vsearch with the dynamic dataset of the latest UNITE database (4h on my computer) and created a taxa barplot:
taxa_barplot.qzv (482.5 KB)
As there are a lot of unassigned sequences, I searched for the problem in this forum and found two things to improve, using the untrimmed developer dataset of the database and also use the dataset that also includes non-fungi eukaryotes (12h). This resulted in the following:
taxa_barplot2.qzv (528.0 KB)
So it appears that many of my reads are actually non-target taxa, but there are still many unassigned sequences.
Also, now I get a lot that are "k_unidentified". Does it make sense to remove sequences with k_unidentified from the reference database to get more informative results?
Do you think I could identify more sequences if I used a trained classifier instead of vsearch? Roughly, how much longer would that take to run?
Blasting some of the unidentified sequences in NCBI gave results like "uncultured fungus", but some also more specific results (both fungi and other kingdoms). But I guess combining UNITE and NCBI databases as reference would lead to impracticable runtimes.
Could the relatively high error tolerance in Dada2 cause these problems?
Also, I stumbled over ITSxpress, but I'm unsure if this only makes sense if you use whole metagenomes instead of metabarcoding. As I understand, it trims sequences to the "interesting" part of the ITS region, correct? So if the sample DNA was amplified using ITS primers, do I still need to do this step?
This is the first time I'm doing Data Analysis of Metabarcoding Data, so I'm still trying to piece together how everything works, but as I got very few unassigned sequences (1-2%) in my other dataset of 16S sequences, I think I'm not doing it totally wrong.
I hope you can point me in the right direction what I can try with my ITS data to improve the results.
Please tell me if you need more information (e.g. code I used or result files).
Best Regards
Sonja