q2-sidle running out of memory at reconstruct-counts

Hi everyone,

In the last couple of weeks I've been playing with the great q2-sidle to try to amalgamate information regarding six amplicons spanning all nine hypervariable regions (V1V2, V2V3, V3V4, V4V5, V5V7 and V7V9). I went smoothly through all the steps of the recommended pipeline up until qiime sidle reconstruct-counts.

Here is the command:

qiime sidle reconstruct-counts \
 --p-region V1V2 \
 --i-kmer-map Results/Sidle/db_V1V2_300nt-map.qza \
 --i-regional-alignment Results/Sidle/V1V2_align-map.qza \
 --i-regional-table Results/Sidle/V1V2_table-300nt.qza \
 --p-region V2V3 \
 --i-kmer-map Results/Sidle/db_V2V3_300nt-map.qza \
 --i-regional-alignment Results/Sidle/V2V3_align-map.qza \
 --i-regional-table Results/Sidle/V2V3_table-300nt.qza \
 --p-region V3V4 \
 --i-kmer-map Results/Sidle/db_V3V4_300nt-map.qza \
 --i-regional-alignment Results/Sidle/V3V4_align-map.qza \
 --i-regional-table Results/Sidle/V3V4_table-300nt.qza \
 --p-region V4V5 \
 --i-kmer-map Results/Sidle/db_V4V5_300nt-map.qza \
 --i-regional-alignment Results/Sidle/V4V5_align-map.qza \
 --i-regional-table Results/Sidle/V4V5_table-300nt.qza \
 --p-region V5V7 \
 --i-kmer-map Results/Sidle/db_V5V7_300nt-map.qza \
 --i-regional-alignment Results/Sidle/V5V7_align-map.qza \
 --i-regional-table Results/Sidle/V5V7_table-300nt.qza \
 --p-region V7V9 \
 --i-kmer-map Results/Sidle/db_V7V9_300nt-map.qza \
 --i-regional-alignment Results/Sidle/V7V9_align-map.qza \
 --i-regional-table Results/Sidle/V7V9_table-300nt.qza \
 --p-min-counts 0 \
 --p-block-size 10000 \
 --o-reconstructed-table Results/Sidle/full_table.qza \
 --o-reconstruction-summary Results/Sidle/full_summary.qza \
 --o-reconstruction-map Results/Sidle/full_map.qza

In every attempt, after 15-20h the process is Killed. I am trying to run this on a machine with 512GB RAM available. Even so, the problem is not enough memory, as showed by the output of dmesg:

Out of memory: Killed process 45808 (qiime) total-vm:522386016kB, anon-rss:513560532kB, file-rss:0kB, shmem-rss:8kB 

I see this is a very common issue for everyone working with the SILVA database and a good fix for this type of issue is to reduce the number of reads per batch, which is done here by reducing the default --p-block-size 10000. I tried to do so (2000, 1000 or 100) and nothing really changed.

When examining the log file (pasted below) and the script available on GitHub, the reason for --p-block-size not helping is obvious: the script runs out of memory way before --p-block-size is even considered (at line 127, within _untangle_database_ids(), while --p-block-size is used at line 151). Therefore, because it looks like the issue is being caused by a chunk of code I can't really modify and because I don't have access to a machine with more than 512GB RAM, I don't know what to try next.

Regional Alignments Loaded
Regional Kmers Loaded
UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown  warnings.warn('resource_tracker: There appear to be %d '

Any suggestions? Am I missing something obvious? Any hint will be greatly appreciated.

P.S.: when running this command with fewer regions (for instance, V1V2 and V5V7 only), everything works great and the process ends within 5h.

Thank you very much,

Vitor

@vheidrich,

Can you provide more details about your SILVA database? How did you make it? Are you using the pre-made sequence and taxonomy files provided on the Data resources page? or making it yourself using RESCRIPt?

Given that the raw SILVA db is large, we’d recommend at least dereplicating the silva data as outlined in the RESCRIPt tutorial, to avoid memory issues similar to what you are observing.

Sure! I preprocessed the SILVA database with RESCRIPt. I pretty much followed this. More specifically, after getting SILVA data (NR99, version 138) with qiime rescript get-silva-data:

  1. Remove low-quality seqs with qiime rescript cull-seqs
  2. Dereplicate identical seqs with qiime rescript dereplicate
  3. Extract specific regions from ref-seqs for each amplicon based off the primers sequences with qiime feature-classifier extract-reads

After these steps I have a high-quality dereplicated SILVA database for each amplicon*. Running these preprocessed databases in parallel in qiime sidle prepare-extracted-region gives me the --i-kmer-maps referenced in the qiime sidle reconstruct-counts command.

*Because I am running other analyses using these very same preprocessed databases, I am confident they are fine.

I hope this clarifies the situation. Thank you for your help!

1 Like

That all sounds good.

Of note, q2-sidle recommends retaining sequences that do not have more than 3 ambiguous bases. Whereas, RESCRIPt’s default is to remove sequences that contain 5 or more ambiguous bases. That is, RESCRIPt is allowing sequences with 4 or less ambiguous bases.

So, you can either add the following command after your rescript cull-seqs command :
sidle filter-degenerate-sequences --p-max-degen 3 ...

or simply rerun the following RESCRIPt command, which is equivalent to the above q2-sidle command:
rescript cull-seqs --p-num-degenerates 4 ...

Yes, the 4 is correct as we are removing sequences with 4 or more ambiguous bases (i.e. allowing a maximum of 3).

This may not solve the memory issue though.

1 Like

It doesn’t hurt to try! I will let you know if filtering out sequences with 4 or more ambiguous bases (instead of 5 or more) helps.

1 Like

Hi @SoilRotifer,

After further filtering the database as you suggested the memory issue persists. This is no surprise given that allowing less ambiguous bases reduces the output of rescript cull-seqs (...seqs-cleaned.qza) from 161mb (default --p-num-degenerates) to 160mb (p-num-degenerates 4).

Anyway, I totally missed that q2-sidle recommendation, so thank you for the tip.

Hi @vheidrich,

It sounds like shrinking the database hasn’t helped. Can you give me a better sense of the dimensions?
It may be that the code needs a re-factor to better handle data of this size. My test sets were relatively small, and it may be that the code needs to be refactored slightly to work with larger dimensions. There’s a slower solution that might have better memory handling that could be explored. I’m sorry it’s having memory isssue, and thankful that people are using the tool and finding the limits.

Best,
Justine

2 Likes

Hi @jwdebelius,

I am running some additional tests right now. I will get back to you next week with a detailed response. Thanks

2 Likes

Hi again @jwdebelius,

After getting SILVA data (NR99, version 138) with rescript get-silva-data:

189M silva-138-ssu-nr99-seqs.qza

After removing low-quality seqs with rescript cull-seqs with --p-num-degenerates 4 (as pointed out by @SoilRotifer):

160M silva-138-ssu-nr99-seqs-cleaned.qza

After dereplicating identical seqs with rescript dereplicate:

148M silva-138-ssu-nr99-seqs-derep-uniq.qza

After extracting specific regions for each amplicon with feature-classifier extract-reads (length 100-600nt):

180M
15M silva-138-ssu-nr99-seqs_V1V2.qza
37M silva-138-ssu-nr99-seqs_V2V3.qza
41M silva-138-ssu-nr99-seqs_V3V4.qza
43M silva-138-ssu-nr99-seqs_V4V5.qza
31M silva-138-ssu-nr99-seqs_V5V7.qza
17M silva-138-ssu-nr99-seqs_V7V9.qza

When using this set of reference databases to build the --i-kmer-maps, sidle reconstruct-counts runs out of memory as already detailed.

Then I tried to increasingly reduce the slices of the SILVA database I was using by playing with feature-classifier extract-reads parameters. As you can see below, the databases were only slightly reduced after each attempt, which was not enough to prevent the memory issue.

First I allowed only reads with length 200-500nt (note that previously I was using 100-600nt) in feature-classifier extract-reads:

170M
15M silva-138-ssu-nr99-seqs_V1V2.qza
36M silva-138-ssu-nr99-seqs_V2V3.qza
38M silva-138-ssu-nr99-seqs_V3V4.qza
35M silva-138-ssu-nr99-seqs_V4V5.qza
31M silva-138-ssu-nr99-seqs_V5V7.qza
17M silva-138-ssu-nr99-seqs_V7V9.qza

Then, besides allowing reads with length 200-500nt, I increased the extract-reads --p-identity from the default 0.8 to 0.9:

159M
13M silva-138-ssu-nr99-seqs_V1V2.qza
33M silva-138-ssu-nr99-seqs_V2V3.qza
38M silva-138-ssu-nr99-seqs_V3V4.qza
35M silva-138-ssu-nr99-seqs_V4V5.qza
30M silva-138-ssu-nr99-seqs_V5V7.qza
12M silva-138-ssu-nr99-seqs_V7V9.qza

In summary, I ran sidle reconstruct-counts with these 3 different sets of regional databases and the process ran out of memory after 20-24h (not correlated with the databases total size) at the same line of code every time.

I hope this explanation is useful and I hope I can somehow manage to use q2-sidle in the near future.

Thank you!

2 Likes

Thanks @vheidrich!

Would you be willing to send me your regional alignments, kmer maps, and maybe either your rep seq or a single sample from your feature table so I can work against the larger data set? I think you can DM them to me.

Thanks,
Justine

2 Likes