July 2, 2008
RHIC/ATLAS Computing Facility 
What's New at the RCF/ACF
Week of
Mon, Jun 30

  • US ATlas: Moving Dcache pool node    7/2/2008
    Wed Jul 2 16:06:35 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-grid-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov

    Summary: 12 Dcache pool nodes moving to a new network switch

    Duration: 4:15PM EDT to 4:30PM EDT

    Group Responsible: BNL Dcache

    Affected Area: BNL Dcache service

    Expected User Impact: None

    Maintenance Type: "Transparent"

    Submitted By: Shigeki Misawa misawa@bnl.gov

    Description: Additional Dcache pool nodes will be moved to a new network switch. Nodes will be moved one at at time and each node will be off line for about 10-30 seconds. No service glitches or interruptions are expected.

  • US Atlas: BNL Dcache system offline Tuesday July 8    7/2/2008
    Wed Jul 2 15:53:27 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-computing-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-grid-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov, atlas-project-adc-operations@cern.ch

    Summary: BNL Dcache will be offline (unavailable) all day on Tuesday July 8

    Duration: 9:00AM EDT - 5:00PM EDT on Tuesday July 8

    Group Responsible: BNL Dcache Group

    Affected Area: All BNL Dcache service

    Expected User Impact: Stop of production, data tranfer, user analysis at BNL

    Maintenance Type: Downtime Submitted By: Shigeki Misawa (misawa@bnl.gov)

    Description: BNL Dcache will be unavailable all day tuesday for maintenance work. We will be doing the following:

    1) Re-establishing backups of Dcache metadata

    2) Upgrading the PNFS server (hardware) More memory, more cpu cores, faster cores.

    3) Moving PNFS backend Postgres database to external RAID storage

    4) Upgrading Dcache to the latest version

    5) Upgrading the backend Postgres database to Postgres 8.3.3

    6) Changing the Postgres backup mechanism

  • US Atlas SSH, Samba, Interactive, and Web servers to be updated    7/2/2008
    Wed Jul 2 15:44:15 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov

    To all,

    The US Atlas SSH gateways, Samba, interactive, and web servers will be updated with the latest kernel and system software security patches next week. Since the kernel is being updated, each system will need to be rebooted. Each system will be unavailable for the time it takes to perform a reboot. The list of systems to be updated and their scheduled reboot times is:

    Monday, July 7, 2008, shortly after 08:00 EDT: atlasgw01.bnl.gov (SSH gateway), asmb00.bnl.gov (Samba Server)

    Tuesday, July 8, 2008, shortly after 08:00 EDT: atlasgw00.bnl.gov (SSH gateway)

    Wednesday, July 9, 2008, shortly after 08:00 EDT: All US Atlas publicly accessible web sites

    Thursday, July 10, 2008, shortly after 08:00 EDT: atlas00.usatlas.bnl.gov (US Atlas interactive server), rt.racf.bnl.gov (RT ticket system)

    John M. (mccarthy@bnl.gov)

  • RHIC SSH, Samba, Interactive, Web, and Mail servers to be updated    7/2/2008
    Wed Jul 2 15:39:35 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    To all,

    The RHIC SSH gateways, Samba, interactive, web, and mail servers will be updated with the latest kernel and system software security patches next week. Since the kernel is being updated, each system will need to be rebooted. Each system will be unavailable for the time it takes to perform a reboot. The list of systems to be updated and their scheduled reboot times is:

    Monday, July 7, 2008, shortly after 08:00 EDT: rssh04.rhic.bnl.gov (SSH gateway), rssh03.rhic.bnl.gov (SSH gateway), rsmb00.rhic.bnl.gov (Samba server), www4.rcf.bnl.gov (RHIC user web server)

    Tuesday, July 8, 2008, shortly after 08:00 EDT: rssh02.rhic.bnl.gov (SSH gateway), rssh01.rhic.bnl.gov (SSH gateway), rcf.rhic.bnl.gov (RHIC mail server), webmail.rhic.bnl.gov (RHIC web-mail server)

    Wednesday, July 9, 2008, shortly after 08:00 EDT: All RHIC publicly accessible web sites, www.phenix.bnl.gov (Phenix web server)

    Thursday, July 10, 2008, shortly after 08:00 EDT: rcf2.rhic.bnl.gov (RHIC interactive server), rt.racf.bnl.gov (RT ticket system)

    John M. (mccarthy@bnl.gov)

  • US ATlas: Moving Dcache pool node    7/2/2008
    Wed Jul 2 11:53:14 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-grid-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov

    Summary: 12 Dcache pool nodes moving to a new network switch

    Duration: 12:00PM EDT to 12:10PM EDT

    Group Responsible: Dcache

    Affected Area: Dcache service

    Expected User Impact: None

    Maintenance Type: "Transparent" Submitted By: Shigeki Misawa misawa@bnl.gov

    Description: 12 Dcache pool nodes will be moved to a new network switch. Nodes will be moved one at at time and each node will be off line for about 10-30 seconds. No service glitches or interruptions are expected.

  • US Atlas: Atlasnfs02 back online    7/1/2008
    Tue Jul 1 13:15:36 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov

    Summary: Atlasnfs02 is back online.

    Duration: 12:50PM EDT

    Group Responsible: Atlas NFS

    Affected Area: Selected Atlas NFS file systems.

    Expected User Impact: Access restored to selected Atlas NFS file systems

    Maintenance Type: Service Interruption/Maintenance Submitted By: Shigeki Misawa misawa@bnl.gov

    Description: Atlasnfs02 is back on line after a system hang. File systems that were affected were:

    /usatlas/groups/sm, /usatlas/groups/susy, /usatlas/groups/tracking, /usatlas/groups/calo, /usatlas/dial, /usatlas/ada_sw, /usatlas/scratch2, /usatlas/groups/higgs, /usatlas/groups/exotics, /usatlas/workarea, /usatlas/OSG

    We believe that the problem was caused by a bug in the OS triggered by NFSv3 readdirplus calls. System has been patched with the latest AIX Service Pack which fixes this problem.

    Opportunity was taken to increase system memory from 4GB to 20GB to allow more space for file system cache.

  • Rebooting rafs11 hoping to fix volume problem    7/1/2008
    Tue Jul 1 12:55:51 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary:

    Rebooting rafs11 hoping to fix volume problem Duration:. There is a problem with star.cvs

    Group Responsible: GCE

    Affected Area: AFS services

    Submitted By: Morris Strongson

  • US Atlas: Atlasnfs02 off line    7/1/2008
    Tue Jul 1 10:36:31 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov

    Summary: Atlasnfs02 is off line

    Duration: 10:34am

    Group Responsible: Atlas NFS

    Affected Area: Selected Atlas NFS directories

    Expected User Impact: /ustlas/groups, and other NFS directories. (Note atlas home directories are on a separate server)

    Maintenance Type: "Service Interruption" Submitted By: Shigeki Misawa misawa@bnl.gov

    Description: Loss of NFS service from atlasnfs02. We are investigating. No time line to resolution

  • Update of PHENIX dCache SRM server COMPLETE    7/1/2008
    Tue Jul 1 10:34:31 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    The PHENIX dCache SRM has been updated and restarted. Please report any problems via the RT StorageManagement queue.

    Thanks, Ofer

  • Update of PHENIX dCache SRM server    6/30/2008
    Mon Jun 30 16:35:26 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary: phnxsrm will be rebooted to apply OS security updates and bug fixes

    Duration: 7/1 10am-11am EST

    Group Responsible: Storage Management

    Affected Area: PHENIX dCache SRM services only

    Expected User Impact: Connections to PHENIX dCache SRM will be temporarily unavailable.

    Maintenance Type: Service interruption Submitted By: Ofer Rind, rind@bnl.gov



  • US Atlas lxr web server update completed    6/30/2008
    Mon Jun 30 14:08:50 EDT 2008

    This item has been posted to usatlas-computing-l@lists.bnl.gov

    To all, The update of the US Atlas lxr web server, alxr.usatlas.bnl.gov (reserve02.usatlas.bnl.gov) has been completed successfully. The system is now available. John M.

  • US Atlas: Dcache maintenance cancelled    6/30/2008
    Mon Jun 30 11:35:35 EDT 2008

    This item has been posted to racf-wlcg-announce-l@lists.bnl.gov, usatlas-users-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-grid-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov, atlas-project-adc-operations@cern.ch

    Summary: Cancellation of Dcache maintenance on July 1.

    Group Responsible: BNL Dcache

    Affected Area: Dcache

    Expected User Impact: None

    Maintenance Type:

    Submitted By: Shigeki Misawa misawa@bnl.gov

    Description: Dcache backup resynchronization and version upgrade scheduled for July 1 has been cancelled because of technical problems. Maintenance will be rescheduled for a later date.

  • US Atlas lxr web server to be updated    6/30/2008
    Mon Jun 30 11:23:19 EDT 2008

    This item has been posted to usatlas-computing-l@lists.bnl.gov

    To all, The US Atlas lxr web server, alxr.usatlas.bnl.gov (reserve02.usatlas.bnl.gov) will be updated with the latest kernel and system software security patches today June 30, 2008, at 14:00 EDT. Since the kernel is being updated the system will need to be rebooted. The system will be unavaliable for the time it takes to perform a reboot. John M. (mccarthy@bnl.gov)

Week of
Mon, Jun 23

  • BNL dCache storage downtime    6/27/2008
    Fri Jun 27 15:39:17 EDT 2008

    This item has been posted to racf-wlcg-announce-l@lists.bnl.gov, usatlas-users-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov, atlas-project-adc-operations@cern.ch

    Summary: Perform dCache catalog backup re-synchronization. Update the dCache version.

    Duration: Tuesday, July 1 at 9AM, estimated downtime of 4-6 hours

    Group Responsible: BNL Storage

    Affected Area: BNL dCache

    Expected User Impact: Stop of production, data tranfer, user analysis at BNL

    Submitted By: Armen Vartapetian, vartap@uta.edu

  • rmine217 is back online.    6/27/2008
    Fri Jun 27 12:12:58 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary: rmine217 is online as of 11:45.

    Group Responsible: NFS

    Affected Area: /phenix/adata04

    Submitted By: Dave Free dfree@bnl.gov

    Description: The failing system drive was replaced but the system would not power on afterward. The powersupply and memory was replaced but did not solve the problem. The server was then swapped with a spare as the mother board must have failed.

  • shutdown of rmine217 at 10AM tomorrow for 15 to 30 minutes.    6/26/2008
    Thu Jun 26 15:58:34 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary: Shutdown of rmine217 to replace a failing system drive.

    Duration: Tomorrow 6/27/08 at 10AM for 15 to 30 minutes.

    Group Responsible: NFS

    Affected Area: /phenix/adata04

    Expected User Impact: /phenix/adata04 will be down during this maintenance.

    Submitted By: Dave Free dfree@rcf.rhic.bnl.gov

  • scheduled MySQL maintenace at BNL completed    6/25/2008
    Wed Jun 25 09:36:37 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-computing-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-grid-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov, atlas-project-adc-operations@cern.ch

    MySQL maintenance/upgrade to 5.0.51a on dbarch3/4 at BNL is completed.



  • scheduled MySQL maintenace at BNL completed    6/25/2008
    Wed Jun 25 09:35:54 EDT 2008

    This item has been posted to

    MySQL maintenance/upgrade to 5.0.51a on dbarch3/4 at BNL is completed.



  • RHIC: NFS problems     6/24/2008
    Tue Jun 24 12:17:34 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary: Problems with NFS service to Star/Phobos file systems.

    Duration: 12:15PM -

    Group Responsible: NFS

    Affected Area: Star/Phobos NFS file systems

    Expected User Impact: Access to Star/Phobos NFS file systems, whic in turn results in login/interactive problems

    Maintenance Type: Service Interruption

    Submitted By: Shigeki Misawa misawa@bnl.gov

    Description: We are experiencing problems with access to Star and Phobos NFS file systems. NFS administrators are looking into the problem.



  • AFS file server rebooted; let us know if things are working for you    6/23/2008
    Mon Jun 23 15:03:09 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary: Rebooted rafs03- I can now access user winter's files; hoping other issues have also been resolved.

    Please send email to mms@bnl.gov and let me know if all AFS problems observed today are resolved.

    Thanks.

    Morris Stronmgson (mms@bnl.gov)

  • Rebooting rafs03 to try to fix AFS troubles    6/23/2008
    Mon Jun 23 14:20:22 EDT 2008

    This item has been posted to rhic-rcf-l@lists.bnl.gov

    Summary:

    rafs03 aqppears very sick, and several users (including myself) are having access issues even with an appropriate token.

    Group Responsible: GCE

    Affected Area: AFS services, etc.

    Submitted By: Morris Strongson

  • END US ATLAS Conditions database maintenance at BNL     6/23/2008
    Mon Jun 23 11:53:03 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-computing-l@lists.bnl.gov

    The BNL ATLAS Conditions oracle database upgrade from 10.2.0.3 to 10.2.0.4 had been successfully completed.

    Database services are available on normal operation.

    All the best,

    Submitted By:

    Carlos Fernando Gamboa, cgamboa@bnl.gov

Week of
Mon, Jun 16

  • Sheduled MySQL maintenance on 2 BNL servers next Wed. June 25    6/19/2008
    Thu Jun 19 15:34:55 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-computing-l@lists.bnl.gov, usatlas-ddm-l@lists.bnl.gov, usatlas-grid-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov, atlas-project-adc-operations@cern.ch

    Summary: We plan to implement MySQL security patches and bug-fixes on two BNL MySQL production servers (dbarch3/4).

    Duration: Wed. June 25, ~9:00am-9:30am EST.

    Group Responsible: Grid

    Affected Area: MySQL BNL databases

    Expected User Impact: During the short time the servers/databases are not available, pilot-submission, updating and access to task-definition and job-archive will be closed. Taking into account that the main PandDB will be running without interruption, the short outage can be transparent for the production system and users.

    Maintenance Type: Service interruption Submitted By: Yuri Smirnov and Tomasz Wlodek ysmirnov@bnl.gov

    Description: We plan to upgrade remaining two BNL MySQL production servers (dbarch3/4) which support ATLAS MyISAM DataBases: PandaArchive/Meta/Log DBs, DataSetDB and implement MySQL security patches and bug-fixes. We already successfully upgraded other 4 MySQL production and fail-over servers at BNL. We will install MySQL-5.0.51a RPMs.

  • MySQL upgrade on 2 BNL servers    6/19/2008
    Thu Jun 19 15:12:45 EDT 2008

    This item has been posted to

    Summary: A one sentence description of your maintenance work.

    Duration: Start Date and Time - End Date and Time

    Group Responsible: HPSS/dCache/Grid/GCE/...

    Affected Area: HPSS services, etc.

    Expected User Impact: User home directories unavailable, or Interactive login sessions appear hung, ....

    Maintenance Type: "Transparent", "Service Interruption", "Downtime", etc. Submitted By: First Name last Name, Email address Description: Your detailed description.

Week of
Mon, Jun 9

  • US Atlas: dc010 is off line    6/12/2008
    Thu Jun 12 20:45:43 EDT 2008

    This item has been posted to usatlas-users-l@lists.bnl.gov, usatlas-prodsys-l@lists.bnl.gov

    Summary: dc010, one of the servers in the Dcache read pool is offline

    Duration: No timeline to repair at this time.

    Group Responsible: dCache

    Affected Area: Files store in the 4 storage pools hosted on the system.

    Expected User Impact: Any file that is stored on a storage pool hosted by this server will be unavailable.

    Maintenance Type: Downtime Submitted By: Shigeki Misawa (misawa@bnl.gov)

    Description: We are experiencing hardware problems on dc010 which is preventing access to disks on the system. A service call has been placed to the vendor and we are waiting for an analysis of the problem and a resolution.

    Best guess at this point is that we may need to swap out the entire server chassis or the processor module in the system.

Last Modified July 2, 2008
RACF Staff