[rancid] Dealing with rancid dying under heavy load

Sat Feb 22 08:44:40 UTC 2014

On 22/02/2014 09:35, heasley wrote:
> Sat, Feb 22, 2014 at 09:02:16AM +0200, Alan McKinnon:
>> Hi,
>>
>> Recently I had a moment of over-zealous enthusiasm and turned PAR_COUNT
>> up higher (from 30 to 50) to get rancid-run to complete quicker. Sadly,
>> about 1 in 10 instances of rancid started failing mysteriously. The logs
>> mostly just give the dreaded "End of run not found" message, a few have
>> strange password errors for a device that is configured correctly.
>> Running rancid on these manually where the host is idle always completes
>> correctly.
>>
>> In each case I find that the *rancid parser did run and minimally
>> launched *clogin, so I suspect memory or IO issues under load causing
>> scripts to fail. I want to patch the code to trap and report these
>> errors if possible.
>>
>> We all know how tricky this can get, anyone in a position to discuss how
>> best to proceed? I'll do the heavy lifting of coding and testing, I do
>> want to pick other's brains first :-)
>>
>>
>>
>> Background info:
>>
>> FreeBSD 8.0-STABLE
>> perl-5.8.9
>> expect-5.44.1.15
>> rancid-2.3.8
>> rancid hosts are VMs on ESXi-4-something, 1 cpu, 1 nic, 1g RAM
>> the NIC is gigaBit full-duplex.
>>
>> "load" tends to run rather high, easily getting to 10+ and frightening
>> newbie sysadmins, but this has never been a problem in the past as
>> rancid is IO-bound anyway and scripts tend to spin along till they complete.
>> 4M Total, 28K Used, 1024M Free
> 
> the targets are VMs or the rancid host is a VM?

The rancid host is a VM. The targets are Cisco kit.

> unless the host can't keep up with the expect processes well enough to stay
> within the login script's timeout period (or what you've set in cloginrc),
> it should not fail - but i havent tried 30 or 40, usually kern.smp.cpus

That was my thought too - it shouldn't fail and timeouts shouldn't
happen. These targets are all on fast networks (usually GigaBit, some
100M and all respond quickly).

I might have hit a threshold on the ESXi host; I don't have visibility
into that environment and can't see what else it's hosting.

> if the rancid host is a VM and assuming it is timing out, but plenty of
> spare cpu and net; figure out if its actually timing out due to wall clock
> time, vs. missing interrupts for example.

I'll look into that. I'd discounted simple timeouts as another rancid
system that deals with kit out in the field has PAR_COUNT=50 and it's
been tested as high as 100. Some of that kit is stupid slow, I've seen
show runs take 10 minutes to complete and the system just deals with it
as expected.

Both systems have the same OS config and both run as VMs in the same
environment. I think step one is to get monitoring graphs out of the
VMWare team.

-- 
Alan McKinnon
alan.mckinnon at gmail.com