V3-V4 silva classifier

I’m training a Naive Bayes classifier for V3–V4 (341F/806R) 16S data. I’m using the QIIME2 pre-formatted SILVA 138 SSURef NR99 full-length references (from Data resources — QIIME 2 2024.10.1 documentation ):

My question: Since these references are already processed and QIIME2-compatible, do I still need to run additional RESCRIPt steps like cull-seqs, filter-seqs-length-by-taxon, and dereplicate before training, or is it sufficient to directly extract the 341F/806R region from the pre-formatted full-length sequences and then train the classifier?

Concretely, is the following workflow appropriate (extract V3–V4 region first, then train), without additional RESCRIPt cleaning?

qiime feature-classifier extract-reads \
  --i-sequences silva-138-ssuref-nr99-seqs.qza \
  --p-f-primer CCTAYGGGDBGCWGCAG \
  --p-r-primer GACTACNVGGGTMTCTAATCC \
  --p-read-orientation auto \
  --p-min-length 300 \
  --p-max-length 600 \
  --o-reads ref-seqs-341f-806r.qza

If additional preprocessing is recommended even when starting from the pre-formatted references, could you clarify which steps are still necessary (if any) and why?

Thanks a lot!

Hello @Wenxuan_Dong

Welcome to the forums! :qiime2:

Good question!

These are a .fasta file and a taxonomy.txt file imported into Qiime2 archives, and that's it. No other processing has been performed, so yes using those RESCRIPt tools is a good idea!

If you are looking for databases that are pretrained and ready to use with the Qiime2 features classifier, you can check out the new Data Resources Page

@Wenxuan_Dong,

If you download those files, you can look at the provenance information by dragging and dropping them onto QIIME 2 View.

You will see that these files have had some basic filtering performed:

  • cull-seqs on the full length data
  • filter-seqs-length-by-taxon
  • dereplication

I will say, that if you are interested in making an amplicon-region specific classifier You probably do not need to run filter-seqs-length-by-taxon. Just dereplicate, extract the region, then cull / dereplicate those sequences. See this example post.

1 Like

@Wenxuan_Dong the SILVA team is providing pre-calculated QIIME 2 classifiers for 138.2 for various regions and habitats. More information can be found here. Unfortunately, we have not yet created classifiers for the V3V4 region. But we had a request through our helpdesk for the same region and will start creating the classifiers later this week. Since we create not only the uniform but also the weighted classifiers, it might take a while until our pipeline is done. We’ll let you know when the classifiers are available for download.

4 Likes

Hi @SoilRotifer and @colinbrislawn

thanks again for the clarifications!

I checked the artifacts in QIIME2 View, but I want to make sure I’m interpreting the provenance correctly.

When I drag silva-138-99-seqs.qza into QIIME2 View and navigate to the Provenance tab, I do not see any explicit RESCRIPt actions (e.g., rescript cull-seqs, filter-seqs-length-by-taxon, or dereplicate). The provenance appears minimal and does not clearly indicate whether those steps were applied. Please see below:

Squares in the graph represent QIIME 2 Actions. The circles within the squares represent the QIIME 2 Results produced by those actions. The arrows indicate the Result they are originating from was used as an input to the Action they are pointing to. The label on the arrow is the name of the input the Result was used as.

Click on an element of the Provenance Graph to learn more. Alternatively, you can search the graph for actions and results matching specific criteria

Search Query Instructions:

To search for a given key with a given value use the 'key: value' syntax

sampling_depth: 1000
To search for a given key with any value simply give the key on its own

sampling_depth
NOTE:The AND and OR operators described below are simply computed left to right. For complex queries, please use parentheses to indicate precedence.

Rules for Keys:

You can specify multiple levels of key by seperating them with "."

execution.uuid: "<uuid>"
If your key contains a "." you must escape it with "\"

key.contains.per\.iod: "value"
You can combine multiple keys with AND and OR

uuid: "<uuid>" OR (trunc_len: 150 AND hashed_feature_ids: true)
Rules for Values:

If the value you are searching for is a string it must be in quotes

type: "FeatureData"
NOTE: the values true, false, and null are often NOT strings and do not need quotes

Strings match on includes and all other types match on equality

type: "FeatureData"
will match all types containing "FeatureData"

sampling_depth: 1000
will only match sampling depths equaling exactly 1000

Strings can use the start and end of string anchors "^" and "$"

type: ^"FeatureData"
will match all types that start with "FeatureData"

type: "[Taxonomy]"$
will match all types that end with "[Taxonomy]"

type: ^"FeatureData[Taxonomy]"$
will only match exactly "FeatureData[Taxonomy]"

If your search value contains a double quote, you will need to escape it with "\"

type: "\""
will match all types that contain a double quote

The same is true if your string contains a "\"

type: "\\"
will match all types that contain a backslash

Numerical values can use the comparison operators >, >=, <, <=

sampling_depth: >=1000
will match all sampling_depths greater than or equal to 1000

You can also combine values with AND and OR. You must wrap these clauses in parentheses

uuid: ((^"6" AND "5"$) OR "ee")
will match all uuids that start with 6 and end with 5 or contain "ee"

This made me wonder whether there may be differences between:

  • SILVA resources that are simply imported into QIIME2 format, versus

  • SILVA resources that are fully processed with RESCRIPt and then distributed

Could you help clarify:

  1. For the current SILVA 138 SSURef NR99 full-length .qza files on the QIIME2 Data Resources page, should we assume RESCRIPt preprocessing has already been applied only if it appears explicitly in provenance?

  2. If provenance does not show those RESCRIPt steps, is the recommended approach to treat these as minimally processed imports and run cull-seqs, filter-seqs-length-by-taxon, and dereplicate ourselves before training a region-specific classifier?

I just want to avoid either under-processing or accidentally duplicating filtering steps.

Thanks very much for your help!

Yes! That is the difference.

Yes! If it's not listed in the provenance, then it was not run on that file!

Yes. You will have to do the benchmark yourself to see what works best for your data.


This is why the data provenance is built into Qiime2 artifacts! It's a record of all edits made to a file using Qiime2, so you can check what was run and what settings were used.

@Wenxuan_Dong If you are referring specifically to silva-138-99-seqs.qza then yes there has been some minimal processing done as highlighted by the provenance. See the screen-shot below:

If you click on one of the items in the graph, you'll see information on the right. For example, as you can see for this selected item the graph that the plugin q2-rescript was used to run cull-seqs.

Otherwise, you are correct, if you do not see anything pointing to RESCRIPt functions. Then RESCRIPt was not used.

1 Like