Diagnosis of DD033 prosessing


Subject: Diagnosis of DD033 prosessing
From: Erik Katsavounidis (Erik.Katsavounidis@lngs.infn.it)
Date: Sat Feb 12 2000 - 09:31:43 EST


This is a report on the processing of DD033.

From: VAXGS::MACROUSA "LNGS US MACRO group" 11-FEB-2000 19:34:15.02
Subj: Start COPY of FIRST (#1) file of DD033 (RUN011750)
From: VAXGS::MACROUSA "LNGS US MACRO group" 11-FEB-2000 22:45:14.05
Subj: Done COPY of LAST (#53) file of DD033 (RUN011797)

Job downloading (ZEBRA) tape to disk lasted 3+ hours. I made a rough estimate
on the total volume of data transfered. DD033 had ~24 full-runs equivalent
data. Given that RD027 had ~17 full-runs equivalent and lasted ~2 hours,
things are not seriously off. However, we'd better keep an eye having two
things in mind:

1) RD's are in RAW format, DD's are in ZEBRA format. Tape unit mounts most
   likely in a different way and I/O might be different
2) DD033 was using DISK$SCRAUSA7 as output device. There was NO traffic other
   that the ZEBRA TAPE download; however, SCRAUSA7 has huge directories which
   might be slowing down the whole process. This is to be seen.

From: VAXGS::MACROUSA "LNGS US MACRO group" 11-FEB-2000 22:45:16.81
Subj: Submitted FIRST (#1) ZEB2RAW job of DD033 (RUN011750)
From: VAXGS::MACROUSA "LNGS US MACRO group" 12-FEB-2000 01:43:38.99
Subj: Submitted LAST (#53) ZEB2RAW job of DD033 (RUN011797)

Conversion of ZEBRA to RAW lasted 3 hours for the entire tape. This scales
OK given the volume of data processed.

The ZEB2RAW conversion yielded errors in 7 files for which the ZEBRA file
was NOT deleted -- in 5 of them, the error was diagnosed to the DIFFERENCE
between the "reference" size expected for a RAW file (as read in the LGB)
and the actual size of the file produced. The tolerance (TOL 1) was set to
*one* VMS block; the reference-actual difference of block size (R-A...) was
of the order of 100. THIS WILL REQUIRE A BIT MORE DEUGGING FROM MY SIDE.
EVENTUALLY THE THRESHOLD (TOLERANCE) will be raised to 100-200 blocks.

SIZE RUN11774 DD033 RAW REF 645774 ACT 645730 ZEB ACT 663947 R-A 44 TOL 1-NO
SIZE RUN11780 DD033 RAW REF 645776 ACT 645709 ZEB ACT 663821 R-A 67 TOL 1-NO
SIZE RUN11784 DD033 RAW REF 645899 ACT 645818 ZEB ACT 663947 R-A 81 TOL 1-NO
SIZE RUN11787 DD033 RAW REF 645840 ACT 645766 ZEB ACT 664158 R-A 74 TOL 1-NO
SIZE RUN11789 DD033 RAW REF 645805 ACT 645739 ZEB ACT 664074 R-A 66 TOL 1-NO

In the following two runs, the prosessing found "system error" string
in the LOG file which was corresponding to the dump of a MACRO's alarm record
rather than a fortran error. I HAVE CHANGED THE SANITY CHECK LOGIC SO
THAT THIS WILL NEVER HAPPEN AGAIN, i.e, only the FORTRAN system errors
will be searched. Obviously, I deleted the two following ZEBRA files
from disk.

ZEB2RAW LOG ERRORS RUN11776 DD033 SIZE= 1-NO NO-ZEB-DELETE
ZEB2RAW LOG ERRORS RUN11779 DD033 SIZE= 3-NO NO-ZEB-DELETE

As soon as ZEB2RAW was completed for ALL 53 run files, the prosessing
of the RAW RUN files started. If lasted 12 hours. This is rather high
and I'll need more time to understand its origin. IF there were 50%
more data in the tape (respect to the 6 hours of RD027) I would expect
~9 hours to be required for prosessing.

From: VAXGS::MACROUSA "LNGS US MACRO group" 12-FEB-2000 01:43:46.13
Subj: Submitted FIRST (#1) MR MONITOR job of DD033 (RUN011750)
From: VAXGS::MACROUSA "LNGS US MACRO group" 12-FEB-2000 13:36:43.64
Subj: Submitted LAST (#53) MR MONITOR job of DD033 (RUN011797)

The job identified the following calibrations RUNs. I verified that all
of them do exist in the expected area (DISK$SCRAUSA7:[MACRODATA.RAW.CAL]

CALIB: DD033 RUN090134 CAL1 (HIST)/1 (STAT)
CALIB: DD033 RUN090133 CAL1 (HIST)/1 (STAT)
CALIB: DD033 RUN090136 CAL1 (HIST)/1 (STAT)
CALIB: DD033 RUN090137 CAL1 (HIST)/1 (STAT)
CALIB: DD033 RUN011753 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011754 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011755 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011756 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011757 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011758 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011759 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011760 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011761 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011762 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011763 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011764 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011765 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011766 CAL0 (HIST)/2 (STAT)
CALIB: DD033 RUN011767 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011768 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011769 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011770 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011771 CAL0 (HIST)/1 (STAT)
CALIB: DD033 RUN011772 CAL0 (HIST)/2 (STAT)

As you can see in the above list there were 20 RUNs that we NOT found in
the calibration history list: "0 (HIST)". I have checked the STATISTICS file
using the script DISK$SCRAUSA2:[MACROUSA0.ROBOT.COM]CHECK_CALIB.COM
and verified that these are indeed calibration runs, with the exception
maybe of 11756, 11757 and 11767 which were very short runs with calibration
camac lists in between calibration runs. I chose to flag ALL of them as
calibrations in the DISK$MACRO:[MACROCAL.DOCS]CAL_RUN_LIST.TXT
which is now updated.

Notice that if you see "/2 (STAT)" for a calibration run that means
there where two conditions that are met in order to classify it as calibration.

For what concerns the standard processing of RUNs, there were 19 RUNs that
had some sort of error in the MY_MONITOR as follows:

MYSTATS LOG ERRORS RUN11750 DD033 SIZE= 62-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN90133 DD033 SIZE= 64-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN90134 DD033 SIZE= 68-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN90136 DD033 SIZE= 408-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN90137 DD033 SIZE= 325-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN11761 DD033 SIZE= 371-NO
Corrupted ERP buffer (EQP26) - nothing to do about it now.
MYSTATS LOG ERRORS RUN11764 DD033 SIZE= 371-NO
Corrupted ERP buffer (EQP26) - nothing to do about it now.
MYSTATS LOG ERRORS RUN11776 DD033 SIZE= 13-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN11777 DD033 SIZE= 191-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN11778 DD033 SIZE= 192-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN11779 DD033 SIZE= 1-NO
Error in SCINTILLATOR BOX NUMBER - Problem w/ data - nothing to do about it now.
MYSTATS LOG ERRORS RUN11783 DD033 SIZE= 93-NO
Corrupted ERP buffer (EQP21) - nothing to do about it now.
MYSTATS LOG ERRORS RUN11784 DD033 SIZE= 1-NO
CPU time expired
MYSTATS LOG ERRORS RUN11785 DD033 SIZE= 1-NO
CPU time expired
MYSTATS LOG ERRORS RUN11788 DD033 SIZE= 1-NO
CPU time expired
MYSTATS LOG ERRORS RUN11789 DD033 SIZE= 1-NO
CPU time expired
MYSTATS LOG ERRORS RUN11790 DD033 SIZE= 1-NO
CPU time expired
MYSTATS LOG ERRORS RUN11792 DD033 SIZE= 1-NO
CPU time expired
MYSTATS LOG ERRORS RUN11797 DD033 SIZE= 1-NO
CPU time expired

For the above 19 problems, 12 represent problem with the data and there's
nothing we can do about it besides flagging the RUNs. For these 12 runs
I deleted the RAW MACRO RUN file that was kept. For the remaining 7 runs
that had a CPU time limit problem, I relauched them.

DD033 is now completed. It took 3+3+12=18 hours of batch job execution
plus an hour or so to do the book keeping and relauch of CPU expired
jobs.

--Erik



This archive was generated by hypermail 2a24 : Sat Feb 12 2000 - 09:31:45 EST