|
|
Displays that indicate the general health of the subsystem as a
proactive diagnostic
- Continual display of the
RCF Gigabit Snapshot to show overall traffic within RCF and activity level of
switches.
- Alerts (Message with noise) on rcfmon02 and/or monitoring of associated pages:
- As viewed from OUTSIDE
the RCF firewall (rolly.rhic.bnl.gov on .80 subnet)
- As viewed from INSIDE
the RCF firewall (from ru10.rcf.bnl.gov on .6 subnet)
- All of the links available from the
RCF Network Page
should be used to target specific components or groups of systems. This page refers to
numerous links that are useful for monitoring and diagnostics. All monitoring personnel
should review them to be aware of what is available.
Methods to validate or refute indications that a piece of the subsystem
is having trouble
- Many of the links available from the
RCF Network Page.
- Have all or several of the subsystem monitoring systems failed?
- Yes: Try to determine if the monitoring room systems may be
isolated due to non-RCF problems by pinging 130.199.80.24 (the router the monitor
room uses) and 130.199.128.31 (the primary BNL nameserver). If these cannot be reached
or there are many pings lost, notify ITD (See contact information below).
- No: Are either of the ping script reporting errors? (#2 above)
- No: It is unlikely that there is anything seriously wrong
within the RCF network. Check the specific machine(s) in question.
- Yes, DNS errors: If there are continual or frequent DNS
slow/no responses reported, contact the DNS system administrators for the
RCF systems or ITD for BNL nameservers.
- Yes, other: If the system reported unreachable is a SWITCH
(sw1 through sw9 .rcf.bnl.gov), ping -s the switch manually. If it is not
responding after 1 minute, go to ru10.rhic (which is behind the RCF firewall)
and try to ping the same switch. If this ping fails, notify RCF Networking
(See contact information below).
- If this ping is OK, check that all the switches (sw1-9) are pingable. Try also
to ping 130.199.6.124 (switch 5 main interface address- RCF side) and 130.199.27.24
(the BNL router interface address for the firewall input). If either of these fail,
notify ITD.
- If a single RCF system has failed, use RcfLookup to get more
information. (see TOOLS below).
- Alternately use WS_PING (see Tools below) to execute pings, basic
throughput measurements, etc.
- If this is a single farm node, create a CTS ticket for service during normal working
hours; If it is a critical system not monitored by the ping scripts (rminexxx for example),
contact the system administrator.
Diagnostic Tools
- RcfLookup (on rcfmon01) for system and network information. For each of
the sub-panes (not all applicable for) all systems):
- Network: - Does the system appear to be sending and receiving?
- Are there any recent dramatic changes in usage?
- Port: - Confirm that the link is up, 100, in Full duplex, and enabled.
- In and Out errors should be low and stable.
- CPU: - Is the system running at a very high load?
- Disk: - Is the system able to write data?
- Users: - Are there an excessive number of users logged in?
- WS_PING (on rcfmon01) can be used to execute pings, basic
throughput measurements, traceroute (for BNL routing concerns), and scans of complete
RCF subnets as a quick network health check.
- QCHECK (on rcfmon01) is a small version of Chariot that can be
used for UDP or TCP throughput measurements between RCF systems. All RCF systems
should have the necessary endpoints loaded; for testing with remote sites, have the
user download, install, and run the endpoint on there system. A link to the endpoint
download page is on the RCF Network page.
Discussion of dependencies
- Most of the network reporting pages and applications rely on heavily on ru10. Therefore
if it is down or isolated, many displays could show old data.
- DNS failures or slow response can manifest as a wide range of problems, the symptoms of
which may vary because different systems use different primary nameservers.
- Failure to read from NFS / AFS home directories will prevent most logins from succeeding.
In these cases the system will be pingable, but the users will be unable to log in. Normally
this would affect multiple systems.
- Nearly all WAN (Wide Area Network i.e., not at BNL) related issues are beyond the influence
of RCF. All WAN issues should be reported to ITD networking. Beyond that point it is in
the hands of ESnet and others. ITD can call in troubles to ESnet if it is clear that the
failure is within ESnet and not beyond their domain.
- The Gigabit monitoring page and the "outside" ping script and it's associated web page run
on rolly.rhic and will stop updating if the system is isolated. This may give the false
impression that everything is normal.
General Notes
- Always be conscious of what systems you are testing FROM and TO, and be aware that there
may be a number of normally "transparent" switches, routers, or firewalls between points
A and B.
|
|