[rancid] rancid-run repeating to test device on errors that are not recoverable

heasley heas at shrubbery.net
Fri Jan 13 18:17:37 UTC 2017


Fri, Jan 13, 2017 at 06:38:42PM +0200, Alan McKinnon:
> On 13/01/2017 16:16, Mischa Diehm wrote:
> > Hi,
> > 
> > trying to get the logs from rancid into our monitoring system I noticed
> > that rancid would try to login to systems $ROUND times even though the
> > error is clear in terms of being unrecoverable during a rancid-run e.g.:
> > 
> > rancid at noc-XXX:~/logs$ grep 'Update the SSH known_hosts file
> > accordingly.'  RZ-ROUTER.20170113.054748 | grep routerXYZ
> > routerXYZ-fa-0-1.urz.p.unibas.ch clogin error: Error: The host key for
> > routerXYZ-fa-0-1.urz.p.unibas.ch has changed.  Update the SSH
> > known_hosts file accordingly.
> > routerXYZ-fa-0-1.urz.p.unibas.ch clogin error: Error: The host key for
> > routerXYZ-fa-0-1.urz.p.unibas.ch has changed.  Update the SSH
> > known_hosts file accordingly.
> > routerXYZ-fa-0-1.urz.p.unibas.ch clogin error: Error: The host key for
> > routerXYZ-fa-0-1.urz.p.unibas.ch has changed.  Update the SSH
> > known_hosts file accordingly.
> > routerXYZ-fa-0-1.urz.p.unibas.ch clogin error: Error: The host key for
> > routerXYZ-fa-0-1.urz.p.unibas.ch has changed.  Update the SSH
> > known_hosts file accordingly.
> > routerXYZ-fa-0-1.urz.p.unibas.ch clogin error: Error: The host key for
> > routerXYZ-fa-0-1.urz.p.unibas.ch has changed.  Update the SSH
> > known_hosts file accordingly.
> > 
> > in our case MAX_ROUNDS=4… I checked but couldn’t find an fast easy way
> > to fix this. Same for „check your password“ et al. What do you think? Is
> > there an easy way to prevent retrying in case of unrecoverable errors?
> 
> I don't see the retries as being especially problematic. *login will try
> and fail the known_hosts tests many 10s of times in the time it takes to
> retrieve one router's config. The extra processing effort is very little
> indeed, almost below the noise floor.

I suggest that it is not worth any effort to catch all possible failures in
all possible environments.  just let it fail.  if its wallclock time that is
an issue for you, reduce MAX_ROUNDS to 1 or raise PAR_COUNT - or both.



More information about the Rancid-discuss mailing list