QIIME2-Compatible Database for AMF Taxonomy Assignment – Issues with SILVA Lineage Depth

Hello everyone,

I am currently working on AMF (Arbuscular Mycorrhizal Fungi) community analysis and I’m facing difficulties in selecting or customizing a suitable taxonomy reference database that works well with QIIME2 or external tools like VSEARCH, particularly for AMF taxa.

What I’ve Tried:

I used the SILVA 138.1 and 138.2 databases (SILVA_138.1_SSURef_NR99_tax_silva_fixed.fasta and SILVA_138.2_SSURef_tax_silva_fixed.fasta) for taxonomy assignment after clustering OTUs with VSEARCH (--usearch_global).

The taxonomy string mapping was done using SILVA_138.1_taxonomy_map.tsv and later SILVA_138.2_taxonomy_fixed.txt.

Issue:

The taxonomy strings from these SILVA versions contain a large number of intermediate taxonomic levels (e.g., Supergroup, Subgroup, Subphylum, Strain tags), often exceeding 15 ranks. This makes it challenging to extract only the standard 7-level hierarchy (Kingdom, Phylum, Class, Order, Family, Genus, Species).

Here is an example of a taxonomy string from SILVA:

CopyEdit

Eukaryota;Amorphea;Obazoa;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Gymnosporangium;Gymnosporangium;ellisii

In this case, Fungi is expected as Kingdom, but it appears mid-string.

My Question:

Is there any QIIME2-compatible database specifically curated for AMF (18S) that:

  • Uses the standard 7 taxonomic ranks?
  • Works directly with VSEARCH or the QIIME2 feature-classifier plugin?
  • Has been used effectively in recent AMF metabarcoding studies?

Alternatively, is there a recommended way to simplify or map long SILVA taxonomies to the standard 7 levels in a reproducible manner?

Any help, examples, or references would be greatly appreciated.

Thank you!


Salma
PhD Candidate, ANU

Hi @Salma_Sarker ,
Welcome to the forum!

I am not sure about an AMF-specific database for 18S (though some users have posted about using MaarJAM database with QIIME 2, so that might be one to check out). But I can answer your second question:

Yes, you can use the RESCRIPt plugin to parse the SILVA taxonomies (and choose whichever ranks you wish; default is the standard 7). See this tutorial for more details:

Good luck!

2 Likes

Hi @Salma_Sarker,
I am acutally working with a collaborator right now to make a tutorial for an AMF Workflow with 18s.

The collaborator uses MaarjAM and Silva with VSEARCH to taxonomically annotate their reads.

I can circle back here with our tutorial link when it is done, if thats of interest.

Hi Nicholas,
thank you for pointing out the RESCRIPt plugin! I did try to use it to parse the SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta taxonomy and extract only the standard 7 taxonomic ranks. However, I found it a bit difficult to configure correctly. The documentation is helpful, but I’m still figuring out how to tailor it for AMF-specific processing.

Hi @cherman2,

That would be great — I’d really appreciate it if you could share the tutorial once it’s ready! Do you have an estimate for when it might be available?

For my current analysis, I’ve used the AMF-specific primer pair AMV4.5NF (AAGCTCGTAGTTGAATTTCG) and AMDGR (CCCAACTATCCCTATTAATCAT) targeting the 18S rRNA region, so I’m especially interested in any workflows or databases that work well with this region.

Do you think your workflow will work well with the AMV4.5NF / AMDGR primer pair? I'd love to know if it has been tested with those or similar AMF-targeted primers.Looking forward to the tutorial — thanks again for your help!

Hi @Salma_Sarker,

It is far easier to make use of qiime rescript get-silva-data .... If you do not require the sequences you can simply run the following:

qiime rescript get-silva-data \
    --p-version 138.2 \
    --p-target SSURef_NR99 \
    --p-ranks domain phylum class order family genus \
    --p-no-download-sequences \
    --output-dir ./silva-138.2-output \
    --verbose

The --p-no-download-sequences flag will not download any sequences... but will result in an empty QZA sequence file. Run qiime rescript get-silva-data --help to check out other options. Note this action essentially wraps the silva taxonomy parser.

If you need to modify the taxonomy output checkout qiime rescript edit-taxonomy ....

-Hope this helps!

2 Likes

Hi @SoilRotifer

Thanks so much for the guidance!

Yes, I agree that using qiime rescript get-silva-data is simpler for standard taxonomy-only workflows. However, in my case, I do need the sequence data to assign taxonomy to my experimental sequences (AMF OTUs), so I followed the full workflow @Nicholas_Bokulich previously recommended here:
:link: Processing, filtering, and evaluating SILVA data with RESCRIPt

I’ve successfully formatted the SILVA 138.2 database into QIIME 2 .qza artifacts. The only step I did not include was the differential filtering of sequences by length and taxonomy (as outlined in the “Filtering sequences by length and taxonomy” section), because I wasn’t entirely sure if that would impact downstream AMF assignments. I understand the reasoning behind it — avoiding selection bias by preserving shorter Eukaryotic sequences — and I may revisit that step if needed.

Here’s what I’ve done so far:

  • Generated the reference files:
    • silva-138.2-ssu-nr99-seqs-derep-uniq.qza
    • silva-138.2-ssu-nr99-tax-derep-uniq.qza
  • Ran taxonomy assignment using the VSEARCH method (still running)
qiime feature-classifier classify-consensus-vsearch \
  --i-query amf_otu_sequences_97.qza \
  --i-reference-reads silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
  --i-reference-taxonomy silva-138.2-ssu-nr99-tax-derep-uniq.qza \
  --p-perc-identity 0.90 \
  --p-maxaccepts 1 \
  --p-threads 4 \
  --o-classification amf_qiime2taxonomy.qza \
  --o-search-results search_qiime2results.qza
  • Trained a Naive Bayes classifier on the full-length reference sequences:
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
  --i-reference-taxonomy silva-138.2-ssu-nr99-tax-derep-uniq.qza \
  --o-classifier silva-138.2-ssu-nr99-classifier.qza
  • Then, I attempted taxonomy assignment using this classifier
qiime feature-classifier classify-sklearn \
  --i-classifier silva-138.2-ssu-nr99-classifier.qza \
  --i-reads amf_otu_sequences_97.qza \
  --p-n-jobs 4 \
  --o-classification amf_qiime2_classifier_taxonomy.qza

However, this command failed immediately without running — no job was submitted or processed. I suspect the issue may be related to mismatch between the classifier (trained on full-length sequences) and my OTU input (derived from AMV4.5NF and AMDGR primers targeting a specific 18S region). But I’m not entirely sure.

That brings me to my main question:
Would you recommend extracting reads using the primer pair AMV4.5NF (AAGCTCGTAGTTGAATTTCG) and AMDGR (CCCAACTATCCCTATTAATCAT) with qiime feature-classifier extract-reads, and then training an amplicon-specific classifier for this 18S region? Or is the full-length classifier still appropriate in this case?

Thanks again for all your help — the RESCRIPt resources and your input have been incredibly helpful.

2 Likes

Hi @Salma_Sarker ,
Just to make sure we are on the same page: get-silva-data will download and format both the sequences and the taxonomy. The --p-no-download-sequences option will allow you to skip downloading the sequences if you only want the taxonomy, but it's clear that's not the case for you. So in either case I would recommend get-silva-data as a first step. (it sounds like that's what you are doing, but just so others who might read this topic are aware that this action does both).

Now, on to the issue you have now:

So no error was reported? Please share the full traceback/error log if you have one. Without an error log, I would bet that this is almost certainly a memory error, because this is the most frequent issue with training and classifying with NB classifiers. It depends on the size and complexity of the database, and many users report that the full-length SILVA 16S classifier can require between 16 to 32 GB of RAM, and I have seen some report up to 64 GB — for 18S I am not sure, but it's probably similar to 16S. Now I note this:

reduce this to 1 job. Running classify-sklearn with multiple parallel jobs will dramatically influence the memory demand, as in the current implementation multiple instances of the classifier are loaded into memory for separate querying, where N = the number of jobs. (this is perhaps one reason why users report different memory demands, as some might be running parallel jobs)

No, that should be fine. Even if you had a full mismatch (like classifying ITS with an 18S classifier) the job would still run and you would just get strange results or unclassified sequences.

Full-length is fine, but it does come with a higher memory and runtime cost. Extracting reads is worth a shot. It will reduce the size of the classifier significantly, reducing memory demand and runtime. It does run the risk of losing reference sequences that are only fragments or otherwise are missing the primer region, e.g., if it was already trimmed. So if your primers target the terminal ends of the 18S (e.g., similar to the common 27F/1492R primers that sit close to the termini) they might be missing in some of the ref seqs. But if they are internal it should be less of an issue, but this can also be checked (see if you are losing many AMF sequences after extraction).

RESCRIPt also has the action trim-alignment that can be used to extract regions based on position in an alignment, including the position of a primer in an alignment; you could use this if you want to trim to the subregion of the 18S that your primer pair amplifies, but keep all sequences (including those that do not actually amplify with these primers). This will take more time to prepare but could be an alternative to consider if extract-reads leads to too many lost reads.

2 Likes

Hi @Nicholas_Bokulich,

Thank you so much — the job finally ran successfully using a single core (--p-n-jobs 1), and it completed within just 3 minutes.

As you mentioned:

“Running classify-sklearn with multiple parallel jobs will dramatically influence the memory demand...”

Reducing to one job resolved the memory issue.

However, I’ve encountered a new problem. It seems that none of the sequences were assigned to fungal taxa. Most are either classified as Bacteria, Archaea, or remain Unassigned. No AMF or fungal lineages appear in the output.

I'm currently using the SILVA 138.2 classifier (silva-138.2-ssu-nr99-classifier.qza), but my sequences were generated using the AMV4.5NF and AMDGR primers, which specifically target the 18S rRNA region of arbuscular mycorrhizal fungi (AMF).

Could this misclassification be due to using a general-purpose classifier that doesn’t cover AMF well? Would using a classifier trained on the MaarjAM 18S rDNA database be more appropriate in this case?

Thank you again for your guidance!.

Hi @Salma_Sarker ,

No, if anything general purpose is a good thing (to avoid false-positive hits). You can check your taxonomy to be sure, but I am looking at a SILVA 138.2 taxonomy that I have locally and I am seeing, e.g., 433 entries for c__Glomeromycetes

Check your taxonomy to make sure. You can use qiime metadata tabulate to create a searchable visualization from a taxonomy artifact (or other metadata-transformable artifact). So this can be useful for searching both your reference taxonomy, and your classifications.

Also check your primers — maybe they are not so specific? But I don't think that's the problem here, the snippet of classification results you shared looks rather suspicious.

Another possibility is issues with read orientation. q2-feature-classifier automatically infers the read orientation by checking the first few sequences in your query set; but this can lead to errors if there are some noisy sequences or if the reads are in mixed orientations. You can set the read orientation explicitly to prevent auto-detection.

The other possibility is that something went wrong upstream, e.g., during denoising. 3 minutes of classification time is rather fast, so I assume you have very few ASVs, which is also a little unusual (though this could be because you are using rather specific targeted primers so the effective diversity could just be that low). And I notice that your ASV IDs are arbitrary numbers, not the hash IDs typically used by different QIIME 2 plugins. I suggest checking your upstream steps and quality control to make sure nothing went wrong upstream.

Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.