At Datacratic, part of our infrastructure runs in the cloud. Our elastic cluster is managed by StarCluster and job dispatch is managed by Open Grid Scheduler (OGS), a fork of Sun Grid Engine (SGE).
While I was looking at StarCluster’s output to help a new user, I realized there was a lot of timeouts. I dug a bit and found out that the command
qacct was too slow. Following the lead, I understood that the said command parses a text file each time it is executed.
After a few years of operations, our main cluster has dispatched over 5 000 000 jobs. The log file parsed by
qacct was about 2Gb in size. Tada! A couple of search queries taught me that OGS, in its install directory (
<path to ogs>/util/logchecker.sh), has a script, ready to be configured, to rotate its logs. I configured and launched it. The timeouts are gone.
Lesson learned: StarCluster/OGS operators, it is important to configure and schedule that script to run every now and then if you want to keep your operations stable.