[rancid] Dealing with rancid dying under heavy load

Sat Feb 22 07:35:23 UTC 2014

Sat, Feb 22, 2014 at 09:02:16AM +0200, Alan McKinnon:
> Hi,
> 
> Recently I had a moment of over-zealous enthusiasm and turned PAR_COUNT
> up higher (from 30 to 50) to get rancid-run to complete quicker. Sadly,
> about 1 in 10 instances of rancid started failing mysteriously. The logs
> mostly just give the dreaded "End of run not found" message, a few have
> strange password errors for a device that is configured correctly.
> Running rancid on these manually where the host is idle always completes
> correctly.
> 
> In each case I find that the *rancid parser did run and minimally
> launched *clogin, so I suspect memory or IO issues under load causing
> scripts to fail. I want to patch the code to trap and report these
> errors if possible.
> 
> We all know how tricky this can get, anyone in a position to discuss how
> best to proceed? I'll do the heavy lifting of coding and testing, I do
> want to pick other's brains first :-)
> 
> 
> 
> Background info:
> 
> FreeBSD 8.0-STABLE
> perl-5.8.9
> expect-5.44.1.15
> rancid-2.3.8
> rancid hosts are VMs on ESXi-4-something, 1 cpu, 1 nic, 1g RAM
> the NIC is gigaBit full-duplex.
> 
> "load" tends to run rather high, easily getting to 10+ and frightening
> newbie sysadmins, but this has never been a problem in the past as
> rancid is IO-bound anyway and scripts tend to spin along till they complete.
> 4M Total, 28K Used, 1024M Free

the targets are VMs or the rancid host is a VM?

unless the host can't keep up with the expect processes well enough to stay
within the login script's timeout period (or what you've set in cloginrc),
it should not fail - but i havent tried 30 or 40, usually kern.smp.cpus

if the rancid host is a VM and assuming it is timing out, but plenty of
spare cpu and net; figure out if its actually timing out due to wall clock
time, vs. missing interrupts for example.