Plugin error from feature classifier

ADL · February 22, 2018, 6:47pm

Hello,
I have a dataset with only 24 samples, 16S illumnina miseq 2X250 paired end. I am getting this error, but not a lot of info on why, nothing comes out when I use verbose

qiime feature-classifier classify-sklearn
--i-classifier /media/sf_QIIME2_shared/Kale/silva-119-99-515-806-nb-classifier.qza
--i-reads /media/sf_QIIME2_shared/Kale/kale_rep-seqs.qza
--o-classification kale-taxonomy.qza
--verbose

qiime2-q2cli-err-jlntyh0_.txt (4.3 KB)

Thank you

colinbrislawn · February 22, 2018, 7:42pm

Hello Audrey,

Thanks for posting your full command and error. When I looked at the log file, I saw this line at the very bottom:

MemoryError

I think you ran out of memory! How much RAM does you computer or VM have?

Colin

ADL · February 22, 2018, 8:14pm

Thank you for getting back so quickly, I realized this after fishing around some more, and while my computer has plenty of RAM available, it looks like the max the VM will allow me to use is 16,384 MB, which is actually in the "red" zone of the bar that you can adjust the RAM on, the green stops at about 11,000 MB. This seems woefully low, from what I read in other threads you need over 4 GB for the silva file to work correctly. Could this be something I can alter by re-installing the Qiime2 virtual machine and changing a default?

I'm using 16S V4 2x250 miseq reads, the source is plant material so I know I have mitochondria/plastid sequences I need to remove, and I read the silva is better for that.

colinbrislawn · February 22, 2018, 10:03pm

Hello Audrey,

Having 11,000 MB is actually very good. But were you using that much when you got the out of memory error?

I say shut down the VM, open up the settings and move that slider up to 11 or 12 GB, then try running this plugin again. You only have to shut down the VM to change memory (no need to reinstall).

Colin

ADL · February 23, 2018, 12:15am

I also should add that I've been keeping all input/output files in the shared folder on my host computer, perhaps mistakenly believing that I would not use up VM ram with my fastq.gz files and such

jairideout · February 23, 2018, 1:04am

Hi @ADL!

Using a shared folder to store your data should be fine -- that's using hard disk space, which is different than RAM (memory). Did @colinbrislawn's suggestions resolve the issue for you?

ADL · February 23, 2018, 4:39pm

The slider only goes up to 16,384 MB, I can't figure out how to change the upper limit, even with re-installing. There is an option to change the default when you re-install but when I go to settings, the upper limit of RAM doesn't change. I wonder if Windows 10 is thwarting me, wouldn't be the first time. The PC host has over 300 GB free at the moment, we got it specifically for this kind of data analysis

ADL · February 23, 2018, 4:39pm

I was using max RAM, and it still errors out. I found this thread

that suggested up to 30 GB is needed for the silva classifier. I have that and more free on our computer, but I can't seem to increase the amount of RAM allocated to the VM. Makes me wish we bought a mac

colinbrislawn · February 23, 2018, 5:02pm

Hello Audrey,

Ah, retraining could take more memory than classifying, but I'm still surprised that it takes 30 GB. Oh well

Can you use the pre-trained silva database? Or have you considered using a different taxonomy assignment that does not have these massive requirements? I'm a big fan of search + LCA methods like classify-consensus-vsearch.

Colin

ADL · February 23, 2018, 5:19pm

It is the pre-trained one downloaded from the website (silva-119-99-515-806-nb-classifier.qza), and I believe the same one that was used in the tread I linked - they were mistaken in the tread title, they were actually using the pre-trained silva file in the same way I am attempting to.

My samples are endophytic bacteria extracted from plant leaves, so I know I will need to filter out plastid/mitochondrial sequences from plant DNA co-amplification, I read that the silva classifier has those sequences but the green genes does not.

It is disappointing because I have plenty of RAM on my computer, but it doesn't seem that I can increase the Vitrualbox maximum allocation, unless there is a setting on my host PC somewhere that can be changed, but all I seem to find online is directions to the settings bar in the VM, which I have set at max (about 11 GB). I can see in the task manager that a few minutes into the command the memory goes up to 95%, and then goes caput.

ADL · February 23, 2018, 5:27pm

Can I use the latest RDP reference file for classify-consensus-vsearch?

colinbrislawn · February 23, 2018, 5:43pm

Sure! After you get the database, you can run that plugin and pass the RDP database into these two inputs:

  --i-reference-reads ARTIFACT PATH FeatureData[Sequence]
                                  reference sequences.  [required]
  --i-reference-taxonomy ARTIFACT PATH FeatureData[Taxonomy]
                                  reference taxonomy labels.  [required]

Qiime 2 includes pretrained databases, that already have been formated and are all listed over here: Data resources — QIIME 2 2018.2.0 documentation

I think SILVA + classify-consensus-vsearch is probably a good bet because it's included with qiime and should 'just work,' but I'll leave this decision up to you.

Colin

jairideout · February 23, 2018, 7:05pm

Hi @ADL! I second @colinbrislawn's suggestion to try out classify-consensus-vsearch or classify-consensus-blast with either Greengenes, SILVA, or RDP reference sequences.

You're mistaking hard disk space (i.e. storage) for RAM (memory). I think you have 300GB of storage space, but only ~16GB of RAM, which is why you are only able to use ~11GB RAM for the virtual machine.

If you want to try out the SILVA pre-trained classifier (or train your own classifier), you could try using the QIIME 2 Amazon EC2 image with an instance type that has more than 30GB RAM. After you're done with this memory-intensive step, you can download your data and continue analyses locally.

colinbrislawn · February 23, 2018, 7:16pm

Qiime on EC2 works great, once you get it set up.

You can rent a supercomputer for a few dollars an hour. Here are some machines and their price per hour:

ADL · February 23, 2018, 7:18pm

Okay, I'm giving it a try but I'm confused as to what files to use for reference-reads and reference-taxonomy

I've got the 128 silva files from

I tried the rep-set 16S only for the reads file (99_otus_16S.fasta ) and consensus_taxonomy_7_levels.txt (16S only) for the taxonomy, it tells me

ValueError: 99_otus_16S.fasta is not a QIIME archive.

qiime feature-classifier classify-consensus-vsearch
--i-query kale_rep-seqs.qza
--i-reference-reads 99_otus_16S.fasta
--i-reference-taxonomyconsensus_taxonomy_7_levels.txt
--o-classification kale_taxonomy.qza

I may be using it all wrong, there isn't much info on this command in the Qiime2 docs and I'm still feeling my way through

ADL · February 23, 2018, 7:22pm

Thank you, I'll check out both of those suggestions, we have access to a supercomputer at our institution as well.

colinbrislawn · February 23, 2018, 9:01pm

Hello Audrey,

Oh, I should have mentioned that you will have to import them into qiime artifacts, specifically a FeatureData[Sequence] and FeatureData[Taxonomy] artifact.

The import page suggests this

qiime tools import \
  --input-path sequences.fna \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

Now that you have the files as qiime artifacts, you can pass them into feature-classifier plugin.

Colin

Nicholas_Bokulich · February 24, 2018, 2:53pm

You were reading bad advice — greengenes does contain plastid and mitochondrial sequences (just to be clear, I'm not partial to any of these databases — but I used to use greengenes in the past with plant samples in which I had the very same problem with non-target DNA so know that it works). Let's just take a look at the databases to be sure:

$ grep 'mitochondria' gg_13_8_otus/taxonomy/99_otu_taxonomy.txt | wc -l
     221
$ grep 'Chloroplast' gg_13_8_otus/taxonomy/99_otu_taxonomy.txt | wc -l
    1546

That command is counting the number of entries for 'mitochondria' and 'Chloroplast' in the reference taxonomy file. As you can see, there are many (and possibly more that do not match my search terms exactly).

Perhaps the report you reads suggested that SILVA has more plastid seqs or sequences specific for your host organism — I don't know these specifics — but that probably doesn't matter too much here (chances are the query plastid seqs will assign to some plastid reference sequence and you are removing them so it doesn't matter which one).

So Greengenes should work for your needs (and has much much lower memory requirements than SILVA since it's around 1/4 the size),

But if you want to go with SILVA and just can't get past these memory issues I agree with @colinbrislawn and @jairideout — use classify-consensus-blast or classify-consensus-vsearch. These methods do perform quite well (not quite as good as classify-sklearn but same ballpark ) and can be a lot easier for users to work with who are familiar with working with the underlying alignment algorithms.

Good luck!

ADL · February 26, 2018, 1:08pm

Thank you all very much for the help!

system · March 29, 2018, 7:08pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.