ERP monitor upgrades

Kate Scholberg (kate@cithe501.cithep.caltech.edu)
Thu, 11 Jul 1996 21:30:47 -0700 (PDT)

Hi all,

Looking back over the macro-erpgc archives I realized that I've been
remiss in documenting code modifications. So, for those interested,
here's the latest ERP monitor software update news:

-- I just updated the db on VXMACB used by the monitor to the latest
constants available, run 11881. These aren't so terribly recent, and
it's not clear that these will tamp down the hot boxes (55,203) as
much as could be desired. But I'll install some truly up-to-date
non-standard ones (the adjusted ones used for my offline analysis)
soon (next couple of weeks).

There have been several minor technical changes to the monitor code
over the last several months. Here are the more significant of them
(in semi-random order):

-- I added end-of-run GC buffer collection and decoding; the end of
run buffers are reported as having event number 1.

-- The buffer collection efficiency problem, which led to decreases in
buffer collection efficiency from about November '95 to March '96,
such that typical efficiencies were down to about 60%, was
found to be due to VXMACB being loaded down with monitoring jobs.
Because VXMACB was so busy, the RCD collection processes were missing
buffers. This problem was fixed in early March by Sandro Marini
increasing the base priority of the dedicated ERP GC monitor queues
from 3 to 4. Since then efficiencies have been back up to >95%.

-- An additional intermittent problem (mentioned in the last memo) was
made worse by the queue priority increase: sometimes (maybe once or
twice per month) the RCD buffer collection processes go out of control
and start eating CPU. With the queue priority increase, a
high-priority runaway RCD process was enough to grind VXMACB to a
halt. I still do not understand what causes this to happen; the
problem is rare and irreproducible and so hard to track down.
However, I have fixed the problem (cured the symptoms anyway) by
adding code to SENTINEL which continually monitors the CPU consumption
of the RCD processes (using calls to the $GETJPI system service). If
it finds any one of them using more than .1% of the CPU (averaged over
several minutes), it kills the offending process. The RCD process
then gets relaunched as part of SENTINEL's normal operation; I also
get email when this happens. Since I implemented this fix in April,
I've gotten about 5 mail messages telling me that SENTINEL killed a
runaway RCDGC process.

-- As noted a couple of months ago, Alec fixed the spark checker
to veto bursts with >= .5 of the hits in a given box.

-- The monitor now keeps track of the average rates from each SM over
approximately the last day time period, by looking at the output files
from previous runs (this code still needs a couple of checks). This
info is written out for large bursts, in order to be able to judge
whether background rates are atypical.

-- Dan Levin has a beeper now too. He has set up code at Michigan to
look for burst mail messages and forward them to his alphanumeric beeper.

That's all I can think of right now.

Still on the to-do list:

-- Add in automated position uniformity parameters (separate postscript
note on this whole issue coming very soon).

-- Add in "probability in 10 years" values for large bursts found.

-- The monitor analysis process still takes up a large fraction (20%)
of available CPU on VXMACB. I have some ideas for improving this.

Kate.