Taxonomy assignment using Silva 138

emmlemore · March 29, 2023, 2:49pm

Hi everyone, new QIIME user here. I'm not even sure if this is the right category to post in and I apologize for my limited understanding of this pipeline.

I've been able to follow the pipeline along using the Moving Pictures tutorial and now I'm at the taxonomy assignment step and I'm a little confused. So we sequenced the V4-V5 region of the 16S rRNA gene with primers 515F-Y (5-GTGYCAGCMGCCGCGGTAA-3) and 926R (5-CCGYCAATTYMTTTRAGTTT-3). The product before primer removal was 250 nts and after primer removal was 230 nts. Here is my command for the denoising step where I truncated at 228 for the forward reads and 198 for the reverse reads (quality score plot attached).

qiime dada2 denoise-paired
--i-demultiplexed-seqs /home/Rocks/outputs/mission_rocks_16S_trimmed.qza
--p-trunc-len-f 228
--p-trunc-len-r 198
--o-table rocks16S_table.qza
--o-representative-sequences rocks16S_rep_seqs.qza
--o-denoising-stats rocks16S_denoising_stats.qza

Now I want to assign taxonomy using the Silva 138 database. I downloaded 2 artifacts from here: Data resources — QIIME 2 2023.2.0 documentation under "Silva (16S/18S rRNA)" and the files were silva-138-99-tax.qza and silva-138-99-seq.qza

So if I understand correctly, I do NOT need to perform the 'import' steps as the files I downloaded are already QIIME artifacts:
qiime tools import
--type 'FeatureData[Sequence]'
--input-path otus.fasta
--output-path otus.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path otu_taxonomy.txt
--output-path ref-taxonomy.qza

I believe the next steps are 'extract reference reads' and 'train the classifier'? This is the script I have prepared but I am not clear on the parameters --p-trunc-len --p-min-length --p-max-length.

qiime feature-classifier extract-reads
--i-sequences /home/Downloads/silva-138-99-seqs.qza ###this is my downloaded Silva file sequence right?
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer CCGYCAATTYMTTTRAGTTT
--p-trunc-len 230 ###should this be set to my amplified product? which is 230 after primer removal
--p-min-length ###I'm not sure what to set this to to avoid losing valuable reads?
--p-max-length ###also not sure what to set this to
--o-reads ref_seqs_silva138.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref_seqs_silva138.qza ##output from previous step
--i-reference-taxonomy /home/Downloads/silva-138-99-tax.qza ###this is my downloaded Silva taxonomy file right?
--o-classifier classifier.qza

I just want to make sure the steps above are correct. Thank you!

SoilRotifer · March 29, 2023, 4:17pm

Hi @emmlemore, welcome to :qiime2:!

Correct. Note: the files towards the top of that page are pre-trained classifier artifacts for the full-length SSU data, and the V4 region. If you'd like to create the classifier yourself, then you'd use the other files further down the page. Again, you do not need to import these as they are already QIIME 2 artifacts.

You do not need to use these parameters. But they are helpful in removing sequences that might be too long or too short for your needs. I often do not use the --p-trunc-len unless I am dealing with highly variable sequences, e.g. ITS, or if I am only using R1 (forward) reads. Remember, if a sequence is shorter than the --p-trunc-len value, it will be discarded, longer reads will be trimmed to this length. For the other two parameters, if I use them, I'll often add / subtract 10-30% of my expected amplicon length so I have some breathing room. That is, I might set the following: --p-min-length 180 --p-max-length 280.

Yep!

Just an FYI, we have a very detailed tutorial for you here.

emmlemore · March 29, 2023, 5:08pm

Hi Mike, thanks so much for your response. I have another question. You said

So the files at the top of that page are 'pre-trained' and the files I downloaded (towards bottom of page) are un-trained and I would have to train myself. Is it better to train it myself? If I downloaded the first file (/silva-138-99-nb-classifier.qza) that means I can directly perform the follow command right?

qiime feature-classifier classify-sklearn
--i-classifier path/silva-138-99-nb-classifier.qza

This might be a dumb question but that same file at the top of the page, Silva 138 99% OTUs full-length sequences (silva-138-99-nb-classifier.qza), it says OTUs and I was wanting ASVs. Is that relevant?

Thanks!!!

SoilRotifer · March 29, 2023, 5:18pm

Correct.

These are not "untrained", but simply the separate sequence and taxonomy files that were used to train the pre-made classifier at the top of the page.

Correct!

If you work through the linked tutorial we describe why we use the 99% OTU as the basis of the SILVA classifier. You can use RESCRIPt to download and curate the full non-clustered SILVA databse if you'd like.

Re OTU vs ASV: tl;dr, it does not matter. Technically speaking an ASV is an OTU. That is, the "Taxonomic Unit" is "Operational" given your current definition. That is, if you want to define your taxonomic unit of measure as a denoised sequence "clustered" at 100 % identity, then that is your OTU for the given study. In other words, ASV is a short hand for this definition.

emmlemore · March 30, 2023, 5:34pm

Hi again, I am having issues when running the feature-classifier classify-sklearn step. The job gets killed and digging through the forum, it looks to be a memory issue.

To get the RAM of the linux box I am using, I used the command grep MemTotal /proc/meminfo and the output is MemTotal: 16303708 kB

I have changed my code to add the options --p-reads-per-batch --p-n-jobs but I still received an error and the job was killed:

qiime feature-classifier classify-sklearn
--i-classifier /home/Downloads/silva-138-99-nb-classifier.qza
--i-reads /home/Rocks/outputs/rocks16S_rep_seqs.qza
--o-classification rocks16S_taxonomy.qza
--p-reads-per-batch 5000
--p-n-jobs 1
--p-confidence 1
--verbose &> 16S_classify_verbose.log & disown

--verbose: command not/home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 3 leaked file objects to clean up at shutdown warnings.warn( /home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 26 leaked semlock objects to clean up at shutdown warnings.warn( /home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown warnings.warn(

I don't know how to make it run as I've already reduced the -p-n-jobs and -p-reads-per-batch.

Below are the specs of the linux box as determined by the command lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
CPU family: 6
Model: 63
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 2
CPU max MHz: 3800.0000
CPU min MHz: 1200.0000
BogoMIPS: 6984.49

Thank you for your time.

SoilRotifer · March 30, 2023, 6:20pm

Hi @emmlemore,

That indeed looks like a memory issue. I'd set --p-reads-per-batch to an even lower value, ~ 1000. Is this your amplicon specific classifier? If not, I'd suggest extracting your amplicon region and training your own classifier. This will drastically reduce the memory resources required to classify your sequences. It is not uncommon to require 24 - 64 GB of RAM with full-length SSU classifiers.

On another note, I would not recommend setting --p-confidence 1. If you do not want to use 0.7, then 0.8 - 0.9 is likely okay. But 0.7 works well for the majority of cases. Otherwise, the higher you set this value, the less likely you'll be able classify most of your data beyond family or genus level.

emmlemore · March 30, 2023, 7:07pm

Hi there,

Thanks again for your response. I have lowered --p-reads-per-batch to 1000 and the job was still killed. I was using the full length silva-138-99-nb-classifier.qza file. So I have tried to train the classifier to my region of interest V4-V5 but this job was also killed due to memory (see code below)

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads /home/Rocks/outputs/ref_seqs_16S_silva138.qza
--i-reference-taxonomy /home/Documents/RawSequences/Missions-Rocks/outputs/taxonomy_silva138.qza
--o-classifier classifier16S_v4v5.qza
--verbose &> classifier16S_training.log

I guess my next option would be the RESCRIPt option?

Thank you.

SoilRotifer · March 30, 2023, 7:33pm

Hi @emmlemore,

Potentially. After extracting your amplicon region did you dereplicate the data as outlined here? If not, this might be partly responsible the excessive memory despite using the extracted amplicon region. That is, many reference sequences will become identical over the smaller span of the amplicon region. Dereplication will remove any redundant sequences with identical taxonomy, which can be substantial. You can also perform other optional filtering of your reference data too.

Alternatively, you may want to consider using feature-classifier classify-consensus-vsearch, just simply use the base sequence and taxonomy files that you'd train your classifier with, as your reference data. Otherwise you may need to find a machine with more RAM.

emmlemore · March 31, 2023, 4:25pm

Hi Mike,

I tried the dereplication step but the job was also killed. I started over using RESCRIPt and it seems to have worked. Although I am suspicious as training the classifier (feature-classifier fit-classifier-naive-bayes) and assigning taxonomy (feature-classifier classify-sklearn) both only took about 10-15 minutes to finish and I was expecting it to take a couple of hours...

I also received this warning when training the classifier but I assume this is referring to the "Danger" message at the top of the data resources page:

Saved TaxonomicClassifier to: classifier_16S_v4v5.qza
/home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.24.1. 
It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
  warnings.warn(warning, UserWarning)

Thanks for your help.

SoilRotifer · March 31, 2023, 4:55pm

With the extracted amplicon region? This is very strange. I assume no other memory intensive applications were running at the time? What system / OS are you running these commands on?

This is typical. These are quite different approaches. One thing to keep in mind, is the trade off between speed an accuracy, and the algorithm used among the various classifiers. Sometimes they can differ quite a bit, other times, not so much. But if the results appear reasonable to you, then you should be fine.

Yep. Sometimes, that just let's you know that the classifier you are using was made with a different version of QIIME 2 compared to the version of QIIME 2 you are running (often the versions of scikit-learn are tied to the version of QIIME 2). I'm not necessarily saying that is the case here.

emmlemore · March 31, 2023, 5:11pm

With the extracted amplicon region? This is very strange. I assume no other memory intensive applications were running at the time? What system / OS are you running these commands on?

Yes with the extracted amplicon region (V4-V5 515F and 926R). And no, I was not running any other memory intensive processes. I am using a Linux OS.

System:
  Kernel: 5.15.0-67-generic x86_64 bits: 64 compiler: gcc v: 11.3.0 Desktop: Cinnamon 5.6.8
    tk: GTK 3.24.33 wm: muffin dm: LightDM Distro: Linux Mint 21.1 Vera base: Ubuntu 22.04 jammy
Machine:
  Type: Desktop System: Dell product: Precision Tower 5810 v: N/A serial: <superuser required>
    Chassis: type: 7 serial: <superuser required>
  Mobo: Dell model: 0K240Y v: A01 serial: <superuser required> UEFI: Dell v: A19
    date: 05/08/2017
CPU:
  Info: 6-core model: Intel Xeon E5-1650 v3 bits: 64 type: MT MCP arch: Haswell rev: 2 cache:
    L1: 384 KiB L2: 1.5 MiB L3: 15 MiB
  Speed (MHz): avg: 1414 high: 2376 min/max: 1200/3800 cores: 1: 1420 2: 1197 3: 1197 4: 1952
    5: 1197 6: 1198 7: 2376 8: 1197 9: 1198 10: 1650 11: 1197 12: 1197 bogomips: 83801
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx

Nicholas_Bokulich · April 3, 2023, 10:21am

Hi @emmlemore ,
Would you mind sharing the inputs to the dereplicate command so that we can inspect and troubleshoot?

It is rather strange for the dereplication step to fail due to a memory error, as I do not think that is a particularly memory intensive step (compared to, e.g., training a classifier).

Yes this is suspicious, it sounds like the primer extraction step may have only hit a few reads, leading to a tiny database.

emmlemore · April 3, 2023, 12:47pm

Hi Nicholas,

I apologize for my lack of clarity. I meant that the dereplication step was successful, but after dereplication, training the classifier still failed and the job was killed, likely due to memory.

However, once I got my own Silva data using the RESCRIPt pipeline (following all the way through steps 1a through 1e). After dereplication I applied an additional taxa filter-seq step as recommended here before training the classifier (feature-classifier fit-classifier-naive-bayes) and taxonomy assignment (feature-classifier classify-sklearn). It was after this, I was able to finally carry out these last 2 memory intensive steps, but as mentioned. these last 2 steps only took 10-15 minutes to finish... so maybe something went wrong after all.

###dereplication of Silva full length sequences:
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-seqs-filt.qza
--i-taxa silva-138.1-ssu-nr99-tax.qza
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza

###making amplicon specific classifier
qiime feature-classifier extract-reads \ ##16s
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA \ #515F
--p-r-primer CCGYCAATTYMTTTRAGTTT \ ###926R
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-seqs-v4v5.qza

###dereplication of amplicon specific region
qiime rescript dereplicate \ ##16s
--i-sequences silva-138.1-ssu-nr99-seqs-v4v5.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-v4v5_derep.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-v4v5_derep.qza

###filter seq step
qiime taxa filter-seqs
--i-sequences /home/Rocks/outputs/rescript/silva-138.1-ssu-nr99-seqs-v4v5_derep.qza
--i-taxonomy /home/Rocks/outputs/rescript/silva-138.1-ssu-nr99-tax-v4v5_derep.qza
--p-exclude Eukaryota,Mitochondria,Chloroplast,Unassigned
--o-filtered-sequences 16S_v4v5_derep_filt.qza

###train amplicon specific classifier
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads /home/Rocks/outputs/rescript/16S_v4v5_derep_filt.qza
--i-reference-taxonomy /home/Rocks/outputs/rescript/silva-138.1-ssu-nr99-tax-v4v5_derep.qza
--o-classifier classifier_16S_v4v5.qza --verbose &> classifier16S_training.log & disown

###assign taxonomy
qiime feature-classifier classify-sklearn
--i-classifier /home/Rocks/outputs/classifier_16S_v4v5.qza
--i-reads /home/Rocks/outputs/qza_intermediates/rocks16S_rep_seqs.qza
--o-classification rocks16S_taxonomy.qza
--p-reads-per-batch 10000
--p-n-jobs 1
--verbose &> classify16S_verbose.log & disown

Thank you.

Nicholas_Bokulich · April 3, 2023, 1:04pm

Thanks for clarifying.

You could use the RESCRIPt evaluate-seqs action to check the sequences before and after trimming, dereplication, and filtering to see how many sequences remain and if anything looks suspicious (e.g., absolute number of sequences at the end). You can do the same on evaluate-taxonomy (after filtering the taxonomy at each step) to see if you are losing any taxa.

Do you want to give that a try and share the QZVs here?

emmlemore · April 5, 2023, 1:08pm

Hi @Nicholas_Bokulich

Thanks for your advice. Here are the outputs of these evaluate-seqs and evaluate-taxonomy commands. I am not sure how to interpret them, especially the evaluate-seqs output... The bars do not match up... does this mean sequences were largely lost in my samples (rocks_predicted_seq)?

Thanks.

Nicholas_Bokulich · April 5, 2023, 1:54pm

Hi @emmlemore ,
Could you please share the QZV? It would make it easier to read

But the short answer is that yes you are vastly reducing the number of sequences (exactly what you want) but only minimally reducing the number of taxa represented (which is probably expected as well, as the primers might not hit all clades)

Thanks!

emmlemore · April 5, 2023, 1:57pm

Hi @Nicholas_Bokulich

Here are the .qzv files. So does this mean everything worked as it should?

seq-evaluation.qzv (388.8 KB)
taxonomy-evaluation.qzv (488.1 KB)

Thank you

Nicholas_Bokulich · April 5, 2023, 3:04pm

Hi @emmlemore ,

Yes everything looks like it worked as intended... but I see that you compared the trimmed reference database against the classified query sequences, instead of the trimmed versus untrimmed reference sequences. This is not a meaningful comparison. But it shows that your reference database looks fine, i.e., lot's of reference sequences were used to train the database. So unless if you see something strange in the taxonomic classifications, I think it sounds like your problem is solved!