Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

This is just perfect. Thank you all!

2 Likes

A couple of super short observations @SoilRotifer @Nicholas_Bokulich as I'm following this tutorial for COI work. The block of code following the following section needs a couple of tiny adjustments:

As it currently is shown:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --i--p-perc-identity 97 \
...

It's just a type with the last line I've shown in the code block above ^^.

Replace --i--p-perc-identity with --p-perc-identity like this:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 97 \
...

Along the same lines, the expected value in the --p-perc-identity is now expected to be a value between 0 and 1, instead of an integer from 0 to 100. So in reality, the code I'd need to run would be:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 0.97 \
...

Thanks for the great tool - more to come on my end soon with the COI tests.

3 Likes

Nice catch @devonorourke!

5 posts were split to a new topic: Importing BOLD COI sequence data into QIIME 2 / RESCRIPt

Update: if working with 1.8 million COI sequences, be prepared to wait about 12 days...
:sleeping:

5 Likes

A post was split to a new topic: RESCRIPt get-ncbi-data timeout error

Hello, great job.
Did you distinguish between ‘Consensus’ or 'Majority Taxonomies’as described in the Qiime2-formatted SILVA 132 release notes?

Best regards,

Hi @arwqiime,

If you run qiime rescript dereplicate --help you can read the descriptions on how each of the dereplication modes work on the sequences and taxonomy.

--p-mode TEXT Choices('uniq', 'lca', 'majority', 'super')
                          How to handle dereplication when sequences map to
                          distinct taxonomies. "uniq" will retain all
                          sequences with unique taxonomic affiliations. "lca"
                          will find the least common ancestor among all taxa
                          sharing a sequence. "majority" will find the most
                          common taxonomic label associated with that
                          sequence; note that in the event of a tie,
                          "majority" will pick the winner arbitrarily. "super"
                          finds the LCA consensus while giving preference to
                          majority labels and collapsing substrings into
                          superstrings. For example, when a more specific
                          taxonomy does not contradict a less specific
                          taxonomy, the more specific is chosen. That is,
                          "g__Faecalibacterium; s__prausnitzii", will be
                          preferred over "g__Faecalibacterium; s__"

For pre-made classifiers we used the uniq setting.

-Mike

4 Likes

A post was split to a new topic: how to convert FeatureData[RNASequence] to FeatureData[Sequence]

3 posts were split to a new topic: creating a 12-rank SILVA taxonomy with RESCRIPt

A post was split to a new topic: Creating a V3 classifier killed.

An off-topic reply has been split into a new topic: Testing feature classifier accuracy

Please keep replies on-topic in the future.

2 off-topic replies have been split into a new topic: Should I cluster my reference sequences to 97% for my classifier?

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Is there a way to parallelize evaluate-fit-classifier?

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Modifying taxonomic annotation from RESCRIPt

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Slightly different taxa with regional and full length taxonomy classifiers

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: rescript reverse-transcribe error: file not found

Please keep replies on-topic in the future.

There is a typo in the command, it should be --p-replacement-strings not --p-replacementS-strings

1 Like

:man_facepalming:

Thank you for pointing this out @lxsteiner ! :pray:

An off-topic reply has been split into a new topic: how to train my classifier for V3-V4 16SRNA gene region

Please keep replies on-topic in the future.