Go to the New Facility Site

| Home | News | Contacts | Guided Tour | User Information | Organization | RHIC | BNL |
 

Monitoring the Central Reconstruction Server

Most of the control of the Central Reconstruction Server (CRS) machines is in the hands of the experiments. They control the submission of jobs to the CRS Linux Farm and can kill jobs and make Farm nodes available and unavailable.
The monitoring of the CRS by RCF entails making sure that the software on the central IBM AIX machine which interfaces with HPSS is functioning and watching the Farm nodes for a "comm. lost" status, which means either the machine has crashed or the local server can no longer communicate with the master machine.

Central Machine

The central machine runs the master controlling servers for the management software. There is a separate server for each of the four experiments, and the status of these servers can be checked. In addition, one can also check the current allocation and usage of HPSS resources for the CRS batch software. HPSS resource changes should be coordinated with the HPSS team and relayed to the Linux Farm team via CTS tickets by the RCF Liaisons.

This central machine is where the staging of files from HPSS is done. If the reconstruction users start to report problems with staging, it is possible to get a list of currently staging files to check on the status of staging. One should then check for staging failures. Staging failures are assumed to be transitory and the result of the temporary unavailability of HPSS resources. The most recent staging failures for each of the experiments can be obtained using the following links. If there are recent staging failures, then one should check HPSS to see if tape resources are busy or are having problems.

Recent HPSS Staging Failures

CRS Farm Machines

Reconstruction jobs are under the control of the reconstruction users. The status of the CRS machines and of the jobs running on the machines can be obtained using the links below.
Machines become "unavailable" either by the user marking them so or by the queuing system failing to queue a job to the node (either a temporary network problem, or the machine has just crashed). The appropriate reconstruction user can mark the machine as "available" again.
If a machine crashes, or the local server crashes and cannot be restarted, then the machine will go into a "comm. lost" state. If an excessive number of machines enter this state, then please contact the CRS administrator.

Machine and Job Status

Current Machine Loads

Plots of summed CRS machine loads for today are available through this form

CPU load for individual machines can be found from Terry's Java applet (need link?).

Historical Information

Much of the above inforamtion plus historical archives of information can be obtained from the CRS Farm Resource Utilization page.



| Home | News | Contacts | Guided Tour | User Information | Organization | RHIC | BNL |
 
U.S. Department of Energy Brookhaven National Laboratory

Report problems or send comments to RCF Webmaster.
Maintained by Tom Throwe.
This document last modified Tuesday August 17, 2004


Privacy and Security Notice