Monday, 9 September 2013

SGE h_vmem vs java -Xmx -Xms

SGE h_vmem vs java -Xmx -Xms

We have a couple of SGE clusters running various versions of RHEL at my
work and we're testing a new one with a newer Redhat, all . On the old
cluster ("Centos release 5.4"), I'm able to submit a job like the
following one and it runs fine:
echo "java -Xms8G -Xmx86 -jar blah.jar ..." |qsub ... -l
h_vmem=10G,virtual_free=10G ...
On the new cluster "CentOS release 6.2 (Final)", a job with those
parameters fails due to running out of memory, and I have to change the
h_vmem to h_vmem=17G in order for it to succeed. The new nodes have about
3x the RAM of the old node and in testing I'm only putting in a couple of
jobs at a time.
On the old cluster, I'd set the -Xms/Xms to be N, I could use N+1 or so
for the h_vmem. On the new cluster, I seem to be crashing unless I set
h_vmem to be 2N+1.
I wrote a tiny perl script that all it does is progressively use consume
more memory and periodically print out the memory used until it crashes or
it reaches a limit. The h_vmem parameter makes it crash at the expected
memory usage.
I've tried multiple versions of the JVM (1.6 and 1.7). If I omit the
h_vmem, it works, but then things are riskier to run.
I have googled where others have seen similar issues, but no resolutions
found.

No comments:

Post a Comment