Re: sentinel

Kate Scholberg (kate@riscgs1.lngs.infn.it)
Mon, 29 May 1995 10:29:46 +0200 (DFT)

Hi folks,

> I just killed SENTINEL et al. on VXMACB. For some reason, there were
> two batch entries for uVax3, one for uVax2 and none for uVax1. However
> there were more than 100 batch entries pending for uVax1. After checking
> with the run (to make sure that all 3 uVax were indeed in ACQ) I killed
> all executing and pending jobs and restart the SENTINEL at
> May 27, 1995 23:34 (VXMACB time). The run number was 10073 and it was
> ~2hours old.
> --Erik
>

This is a known problem. SENTINEL uses the Vax system service routine
GET$JPIW every 15 minutes or so to check for the presence of the RCD
buffer-collecting processes. However, occasionally GET$JPIW fails and
SENTINEL thinks that an RCDGC process is absent even though it isn't,
and relaunches the RCDGC process. The RCDGC process then sits pending
on the VXMACB_TESTQUEUE queue. Normally this is harmless, but
occasionally what can happen is something like this: e.g. RCDGC1,2,and
3 are sitting happily on the queue. Say, RCDGC2 gets spuriously
relaunched and is pending. Now if, say, RCDGC3 dies for legitimate
reasons (the microvax got rebooted for instance) then the pending
RCDGC2 moves up to take its place in the queue before SENTINEL gets a
chance to relaunch RCDGC3. We end up with 2 RCDGC2's and no RCDGC3
process, and lose sensitivity to uvax 3 events.

I'm not sure exactly why GET$JPIW doesn't always return the correct
answer as to whether or not a process is there -- Alec's theory is
that GET$JPIW will miss the process if the process swapped out, which
will happen when VXMACB is having memory problesm, and this sounds
plausible enough to me. In any case, we should try to find a more
reliable system service for this function to try to avoid this
problem.

In the past this problem has happened infrequently enough that fixing
it hasn't been very high up on the priority list, but it's happened
twice in the past month, so fixing it gets notched up there. I'll try
to get on it soon. Meanwhile, I'll keep an eye on the queue and kill
any pending processes find (in fact I check the queue several times
a day anyway).

Kate.