This site is now deprecated.
Please visit the New Facility Site.



| Home | News | Contacts | Guided Tour | User Information | Organization | RHIC | BNL |
 

Network Problem Diagnosis

Displays that indicate the general health of the subsystem as a proactive diagnostic

  • Continual display of the RCF Gigabit Snapshot to show overall traffic within RCF and activity level of switches.
  • Alerts (Message with noise) on rcfmon02 and/or monitoring of associated pages:
    • As viewed from OUTSIDE the RCF firewall (rolly.rhic.bnl.gov on .80 subnet)
    • As viewed from INSIDE the RCF firewall (from ru10.rcf.bnl.gov on .6 subnet)
    • All of the links available from the RCF Network Page should be used to target specific components or groups of systems. This page refers to numerous links that are useful for monitoring and diagnostics. All monitoring personnel should review them to be aware of what is available.

Methods to validate or refute indications that a piece of the subsystem is having trouble

  • Many of the links available from the RCF Network Page.
  • Have all or several of the subsystem monitoring systems failed?
    • Yes: Try to determine if the monitoring room systems may be isolated due to non-RCF problems by pinging 130.199.80.24 (the router the monitor room uses) and 130.199.128.31 (the primary BNL nameserver). If these cannot be reached or there are many pings lost, notify ITD (See contact information below).
    • No: Are either of the ping script reporting errors? (#2 above)
      • No: It is unlikely that there is anything seriously wrong within the RCF network. Check the specific machine(s) in question.
      • Yes, DNS errors: If there are continual or frequent DNS slow/no responses reported, contact the DNS system administrators for the RCF systems or ITD for BNL nameservers.
      • Yes, other: If the system reported unreachable is a SWITCH (sw1 through sw9 .rcf.bnl.gov), ping -s the switch manually. If it is not responding after 1 minute, go to ru10.rhic (which is behind the RCF firewall) and try to ping the same switch. If this ping fails, notify RCF Networking (See contact information below).
    • If this ping is OK, check that all the switches (sw1-9) are pingable. Try also to ping 130.199.6.124 (switch 5 main interface address- RCF side) and 130.199.27.24 (the BNL router interface address for the firewall input). If either of these fail, notify ITD.
    • If a single RCF system has failed, use RcfLookup to get more information. (see TOOLS below).
    • Alternately use WS_PING (see Tools below) to execute pings, basic throughput measurements, etc.
    • If this is a single farm node, create a CTS ticket for service during normal working hours; If it is a critical system not monitored by the ping scripts (rminexxx for example), contact the system administrator.

Diagnostic Tools

  • RcfLookup (on rcfmon01) for system and network information. For each of the sub-panes (not all applicable for) all systems):
    • Network: - Does the system appear to be sending and receiving?
      - Are there any recent dramatic changes in usage?
    • Port: - Confirm that the link is up, 100, in Full duplex, and enabled.
      - In and Out errors should be low and stable.
    • CPU: - Is the system running at a very high load?
    • Disk: - Is the system able to write data?
    • Users: - Are there an excessive number of users logged in?
  • WS_PING (on rcfmon01) can be used to execute pings, basic throughput measurements, traceroute (for BNL routing concerns), and scans of complete RCF subnets as a quick network health check.
  • QCHECK (on rcfmon01) is a small version of Chariot that can be used for UDP or TCP throughput measurements between RCF systems. All RCF systems should have the necessary endpoints loaded; for testing with remote sites, have the user download, install, and run the endpoint on there system. A link to the endpoint download page is on the RCF Network page.

Discussion of dependencies

  • Most of the network reporting pages and applications rely on heavily on ru10. Therefore if it is down or isolated, many displays could show old data.
  • DNS failures or slow response can manifest as a wide range of problems, the symptoms of which may vary because different systems use different primary nameservers.
  • Failure to read from NFS / AFS home directories will prevent most logins from succeeding. In these cases the system will be pingable, but the users will be unable to log in. Normally this would affect multiple systems.
  • Nearly all WAN (Wide Area Network i.e., not at BNL) related issues are beyond the influence of RCF. All WAN issues should be reported to ITD networking. Beyond that point it is in the hands of ESnet and others. ITD can call in troubles to ESnet if it is clear that the failure is within ESnet and not beyond their domain.
  • The Gigabit monitoring page and the "outside" ping script and it's associated web page run on rolly.rhic and will stop updating if the system is isolated. This may give the false impression that everything is normal.

General Notes

  • Always be conscious of what systems you are testing FROM and TO, and be aware that there may be a number of normally "transparent" switches, routers, or firewalls between points A and B.



| Home | News | Contacts | Guided Tour | User Information | Organization | RHIC | BNL |
 
U.S. Department of Energy Brookhaven National Laboratory

Report problems or send comments to RCF Webmaster.
Maintained by Terry Healy.
This document last modified Thursday May 23, 2002


Privacy and Security Notice