Good news @Nicholas_Bokulich :
It works now!
Sorry, it should have occurred to me to stop and look inside the observed feature table before coming here just to paste errors. Not very bioinformatic on my part
The samples did not match because I assumed that the standalone DADA2 trimmed the sample name from the FASTQ file name. Strange thing to assume, I know, but the thing is that until now I have always used DADA2 inside QIIME2, so I have always imported my sequences with a manifest file where I explicitly named the samples that way. Again, my mistake.
What I don't quite understand is the taxonomy issue. I mean, I understand the issue itself (the command is not going to crash because the file format is valid, but I'm not going to have species-level matches). What I don't understand is why the DADA2 taxonomy doesn't include it. I have used the same file (I only have one UNITE FASTA file on this machine), and if I search for the species with grep
(which is how I have made the expected file: searching, copying and pasting), I see this format:
$ cat sh_general_release_dynamic_s_all_04.04.2024.fasta | grep s__Starmerella_apicola | cut -d "|" -f 5 | sort | uniq
Output (with genus in species):
k__Fungi;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Trichomonascaceae;g__Starmerella;s__Starmerella_apicola
Maybe DADA2's assignTaxonomy()
function removes the genus part in the species section under the hood. I think that could be possible since DADA2 recognizes that my FASTA file is from UNITE (the function writes to output this: "UNITE fungal taxonomic reference detected"). Anyway, I corrected the expected file by removing the gender from the s__
part.
I attach the QZV just in case you want to see. That one is from a test run where I used default parameters: KDIST_CUTOFF = 0.42 and BAND_SIZE = 16. I still need to figure out what the output metrics mean, but I'll study them while the rest of the benchmarking is running. My plan is:
- Reading the documentation, figure out if parameters should be increased or decreased for ITS, and how much.
- Select, apart from defaults, 3-4 more values for each one.
- Run all possible combinations.
- Do all the above changing
maxEE
and truncQ
parameters from 2 defaults to 8, since Rolling et al., 2022 state that those values are better for ITS. I want to take the opportunity to see how those parameters interact with the others.
Finally, possible limitations of my benchmarking based on the expected composition table (so everyone can take this into account when I finish):
- Where the paper says Candida apicola, I say Starmerella apicoa (I think this is not going to be an issue)
- Where the paper says Mortierella verticillata (basionym of Podila verticillata) I say Mortierella sp (neither Mortierella verticillata or Podila verticillata was found in my UNITE file, Mortierella at least had a Genus sp entry)
- Where the paper says Neosartorya fischeri, I say its homotypic synonym Aspergillus fischeri (in this case my UNITE file has a Neosartorya sp entry but I think the synonym is more accurate)
When I finish the benchmarking I'll come back to post the results
Again, thank you so much
evaluation.qzv (367.0 KB)