Hi everyone, new QIIME user here. I'm not even sure if this is the right category to post in and I apologize for my limited understanding of this pipeline.
I've been able to follow the pipeline along using the Moving Pictures tutorial and now I'm at the taxonomy assignment step and I'm a little confused. So we sequenced the V4-V5 region of the 16S rRNA gene with primers 515F-Y (5-GTGYCAGCMGCCGCGGTAA-3) and 926R (5-CCGYCAATTYMTTTRAGTTT-3). The product before primer removal was 250 nts and after primer removal was 230 nts. Here is my command for the denoising step where I truncated at 228 for the forward reads and 198 for the reverse reads (quality score plot attached).
Now I want to assign taxonomy using the Silva 138 database. I downloaded 2 artifacts from here: Data resources — QIIME 2 2023.2.0 documentation under "Silva (16S/18S rRNA)" and the files were silva-138-99-tax.qza and silva-138-99-seq.qza
So if I understand correctly, I do NOT need to perform the 'import' steps as the files I downloaded are already QIIME artifacts:
qiime tools import
I believe the next steps are 'extract reference reads' and 'train the classifier'? This is the script I have prepared but I am not clear on the parameters --p-trunc-len --p-min-length --p-max-length.
qiime feature-classifier extract-reads
--i-sequences /home/Downloads/silva-138-99-seqs.qza ###this is my downloaded Silva file sequence right?
--p-trunc-len 230 ###should this be set to my amplified product? which is 230 after primer removal
--p-min-length ###I'm not sure what to set this to to avoid losing valuable reads?
--p-max-length ###also not sure what to set this to
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref_seqs_silva138.qza ###output from previous step
--i-reference-taxonomy /home/Downloads/silva-138-99-tax.qza ###this is my downloaded Silva taxonomy file right?
I just want to make sure the steps above are correct. Thank you!
Correct. Note: the files towards the top of that page are pre-trained classifier artifacts for the full-length SSU data, and the V4 region. If you'd like to create the classifier yourself, then you'd use the other files further down the page. Again, you do not need to import these as they are already QIIME 2 artifacts.
You do not need to use these parameters. But they are helpful in removing sequences that might be too long or too short for your needs. I often do not use the --p-trunc-len unless I am dealing with highly variable sequences, e.g. ITS, or if I am only using R1 (forward) reads. Remember, if a sequence is shorter than the --p-trunc-len value, it will be discarded, longer reads will be trimmed to this length. For the other two parameters, if I use them, I'll often add / subtract 10-30% of my expected amplicon length so I have some breathing room. That is, I might set the following: --p-min-length 180 --p-max-length 280.
Just an FYI, we have a very detailed tutorial for you here.
Hi Mike, thanks so much for your response. I have another question. You said
So the files at the top of that page are 'pre-trained' and the files I downloaded (towards bottom of page) are un-trained and I would have to train myself. Is it better to train it myself? If I downloaded the first file (/silva-138-99-nb-classifier.qza) that means I can directly perform the follow command right?
This might be a dumb question but that same file at the top of the page, Silva 138 99% OTUs full-length sequences (silva-138-99-nb-classifier.qza), it says OTUs and I was wanting ASVs. Is that relevant?
These are not "untrained", but simply the separate sequence and taxonomy files that were used to train the pre-made classifier at the top of the page.
If you work through the linked tutorial we describe why we use the 99% OTU as the basis of the SILVA classifier. You can use RESCRIPt to download and curate the full non-clustered SILVA databse if you'd like.
Re OTU vs ASV: tl;dr, it does not matter. Technically speaking an ASV is an OTU. That is, the "Taxonomic Unit" is "Operational" given your current definition. That is, if you want to define your taxonomic unit of measure as a denoised sequence "clustered" at 100 % identity, then that is your OTU for the given study. In other words, ASV is a short hand for this definition.
--verbose: command not/home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 3 leaked file objects to clean up at shutdown warnings.warn( /home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 26 leaked semlock objects to clean up at shutdown warnings.warn( /home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown warnings.warn(
I don't know how to make it run as I've already reduced the -p-n-jobs and -p-reads-per-batch.
Below are the specs of the linux box as determined by the command lscpu
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
CPU family: 6
Thread(s) per core: 2
Core(s) per socket: 6
CPU max MHz: 3800.0000
CPU min MHz: 1200.0000
That indeed looks like a memory issue. I'd set --p-reads-per-batch to an even lower value, ~ 1000. Is this your amplicon specific classifier? If not, I'd suggest extracting your amplicon region and training your own classifier. This will drastically reduce the memory resources required to classify your sequences. It is not uncommon to require 24 - 64 GB of RAM with full-length SSU classifiers.
On another note, I would not recommend setting --p-confidence 1. If you do not want to use 0.7, then 0.8 - 0.9 is likely okay. But 0.7 works well for the majority of cases. Otherwise, the higher you set this value, the less likely you'll be able classify most of your data beyond family or genus level.
Thanks again for your response. I have lowered --p-reads-per-batch to 1000 and the job was still killed. I was using the full length silva-138-99-nb-classifier.qza file. So I have tried to train the classifier to my region of interest V4-V5 but this job was also killed due to memory (see code below)
Potentially. After extracting your amplicon region did you dereplicate the data as outlined here? If not, this might be partly responsible the excessive memory despite using the extracted amplicon region. That is, many reference sequences will become identical over the smaller span of the amplicon region. Dereplication will remove any redundant sequences with identical taxonomy, which can be substantial. You can also perform other optional filtering of your reference data too.
Alternatively, you may want to consider using feature-classifier classify-consensus-vsearch, just simply use the base sequence and taxonomy files that you'd train your classifier with, as your reference data. Otherwise you may need to find a machine with more RAM.
I tried the dereplication step but the job was also killed. I started over using RESCRIPt and it seems to have worked. Although I am suspicious as training the classifier (feature-classifier fit-classifier-naive-bayes) and assigning taxonomy (feature-classifier classify-sklearn) both only took about 10-15 minutes to finish and I was expecting it to take a couple of hours...
I also received this warning when training the classifier but I assume this is referring to the "Danger" message at the top of the data resources page:
Saved TaxonomicClassifier to: classifier_16S_v4v5.qza
/home/ortmannac/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.24.1.
It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
With the extracted amplicon region? This is very strange. I assume no other memory intensive applications were running at the time? What system / OS are you running these commands on?
This is typical. These are quite different approaches. One thing to keep in mind, is the trade off between speed an accuracy, and the algorithm used among the various classifiers. Sometimes they can differ quite a bit, other times, not so much. But if the results appear reasonable to you, then you should be fine.
Yep. Sometimes, that just let's you know that the classifier you are using was made with a different version of QIIME 2 compared to the version of QIIME 2 you are running (often the versions of scikit-learn are tied to the version of QIIME 2). I'm not necessarily saying that is the case here.
I apologize for my lack of clarity. I meant that the dereplication step was successful, but after dereplication, training the classifier still failed and the job was killed, likely due to memory.
However, once I got my own Silva data using the RESCRIPt pipeline (following all the way through steps 1a through 1e). After dereplication I applied an additional taxa filter-seq step as recommended here before training the classifier (feature-classifier fit-classifier-naive-bayes) and taxonomy assignment (feature-classifier classify-sklearn). It was after this, I was able to finally carry out these last 2 memory intensive steps, but as mentioned. these last 2 steps only took 10-15 minutes to finish... so maybe something went wrong after all.
###dereplication of Silva full length sequences:
qiime rescript dereplicate
You could use the RESCRIPt evaluate-seqs action to check the sequences before and after trimming, dereplication, and filtering to see how many sequences remain and if anything looks suspicious (e.g., absolute number of sequences at the end). You can do the same on evaluate-taxonomy (after filtering the taxonomy at each step) to see if you are losing any taxa.
Do you want to give that a try and share the QZVs here?
Thanks for your advice. Here are the outputs of these evaluate-seqs and evaluate-taxonomy commands. I am not sure how to interpret them, especially the evaluate-seqs output... The bars do not match up... does this mean sequences were largely lost in my samples (rocks_predicted_seq)?
Hi @emmlemore ,
Could you please share the QZV? It would make it easier to read
But the short answer is that yes you are vastly reducing the number of sequences (exactly what you want) but only minimally reducing the number of taxa represented (which is probably expected as well, as the primers might not hit all clades)
Yes everything looks like it worked as intended... but I see that you compared the trimmed reference database against the classified query sequences, instead of the trimmed versus untrimmed reference sequences. This is not a meaningful comparison. But it shows that your reference database looks fine, i.e., lot's of reference sequences were used to train the database. So unless if you see something strange in the taxonomic classifications, I think it sounds like your problem is solved!