[rancid] Dealing with rancid dying under heavy load

Alan McKinnon alan.mckinnon at gmail.com
Sat Feb 22 07:02:16 UTC 2014


Hi,

Recently I had a moment of over-zealous enthusiasm and turned PAR_COUNT
up higher (from 30 to 50) to get rancid-run to complete quicker. Sadly,
about 1 in 10 instances of rancid started failing mysteriously. The logs
mostly just give the dreaded "End of run not found" message, a few have
strange password errors for a device that is configured correctly.
Running rancid on these manually where the host is idle always completes
correctly.

In each case I find that the *rancid parser did run and minimally
launched *clogin, so I suspect memory or IO issues under load causing
scripts to fail. I want to patch the code to trap and report these
errors if possible.

We all know how tricky this can get, anyone in a position to discuss how
best to proceed? I'll do the heavy lifting of coding and testing, I do
want to pick other's brains first :-)



Background info:

FreeBSD 8.0-STABLE
perl-5.8.9
expect-5.44.1.15
rancid-2.3.8
rancid hosts are VMs on ESXi-4-something, 1 cpu, 1 nic, 1g RAM
the NIC is gigaBit full-duplex.

"load" tends to run rather high, easily getting to 10+ and frightening
newbie sysadmins, but this has never been a problem in the past as
rancid is IO-bound anyway and scripts tend to spin along till they complete.
4M Total, 28K Used, 1024M Free


-- 
Alan McKinnon
alan.mckinnon at gmail.com



More information about the Rancid-discuss mailing list