Document fins/I0736-1

FIN #: I0736-1

SYNOPSIS: Current replacement procedures for an A3500FC controller in a
          clustered environment could result in controller going off line.

DATE: Oct/26/01

KEYWORDS: Current replacement procedures for an A3500FC controller in a
          clustered environment could result in controller going off line.


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

                                    

SYNOPSIS: Current replacement procedures for an A3500FC controller in a
          clustered environment could result in controller going off line. 
      

Sun Alert:          No

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  A3500FC Controller  
 
PRODUCT CATEGORY:   Storage / Service


PRODUCTS AFFECTED:  
  
Systems Affected  
----------------
Mkt_ID   Platform      Model    Description                 Serial Number	
------   --------      -----    -----------                 -------------
  -      ANYSYS          -      System Platform Independent       -


X-Options Affected
------------------
Mkt_ID      Platform   Model   Description                 Serial Number
------      --------   -----   -----------                 -------------
X6532A       A3000       -     A3000 15*4.2GB/7200 FWSCSI        - 
X6533A         -         -     A3000 35*4.2GB/7200 FWSCSI        -     
X6534A         -         -     A3000 15*9.1GB/7200 FWSCSI        -    
X6535A         -         -     A3000 35*9.1GB/7200 FWSCSI        -
X6536A         -         -     A3000 StorEdge Controller         -
X6537A       A3500       -     A3500 SCSI controller             -
X6538A       A3500FC     -     A3500FC StorEdge Controller       -
SG-XARY3*      -         -     A3500 7200/10K Controller         -    


PART NUMBERS AFFECTED: 

Part Number   Description   Model
-----------   -----------   -----
     -             -          -  


REFERENCES:

BugId:    4476951 - A3500FC intermittent controller offlines during 
                    normal operations.

ESC:      531240

MANUAL:   805-7854-11 A3x00 Controller Replacement Guide.
          805-6887-10 Sun StorEdge RAID Manager 6.2 User's Guide.
	  806-7073-10 Sun Cluster 3.0 U1 System Administration Guide.
	  805-7076-10 Sun Cluster 3.0 U1 Error Messages Manual.
	  806-5343-10 Sun Cluster 2.2 System Administration Guide.
	  805-4202-10 Sun Cluster 2.2 Error Messages Manual.
	  805-3991-10 Sun Cluster 2.1 System Administration Guide.
	  805-4106-10 Sun Cluster 2.1 Error Messages Manual.
   

PROBLEM DESCRIPTION:

If proper steps are not taken to replace an A3500FC controller in a
clustered environment, one of the nodes will not recognize the new
controller and force it to go offline resulting in a single point of
failure.  Also it will result in the WWN (World Wide Number) of the 
new controller not being updated on one of the nodes in the cluster.

Here's the scenario of configuration for which this problem could 
occur:   
   
  Any Host HW that supports A3500FC & Clustering.
  StorEdge A3500FC
  RAID Manager 6.22 or higher
  Solaris 2.6 or higher

  raidutil -c cXtXdXs2 -i | grep StorEdgeA3500FC

Within 5 minutes of replacing a controller, the other node will attempt
to communicate with the WWN  from the old controller.  Since it will
not be able to do so, it will offline the controller.  

The WWN is part of the device path for an A3500FC RAID Module and it is
unique to a specific controller.  When you replace that controller
using Recovery Guru from only one node, the other node will not be
updated.  This will result in that node forcing that data path
offline.

Here's an example of an A3500FC data path:

  # cd /dev/osa/dev/dsk
  # ls -l c3t5d*s2
  lrwxrwxrwx   1 root     root          70 Jun 25 12:58 c3t5d0s2 -> 
  ../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,0:c
  lrwxrwxrwx   1 root     root          70 Jun 25 12:58 c3t5d1s2 -> 
  ../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,1:c
  lrwxrwxrwx   1 root     root          70 Jun 25 12:58 c3t5d2s2 -> 
  ../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,2:c
  lrwxrwxrwx   1 root     root          70 Jun 25 12:58 c3t5d3s2 -> 
  ../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,3:c
  # 

The WWN is the w200600a0b8078c3b part of it.  If this changes, the RAID
Manager software will not be able to communicate with the controller.
This is why it is necessary to run recovery guru from BOTH nodes.  This
ensures that the device trees are updated on both nodes.

The most likely cause of a controller going offline is a combination of
a hardware failure and the customer running RM6 commands from 2 hosts
at the same time (which could also compound an initial hardware
failure).  Given that it's possible that a controller can have a PCI
error and subsequently continue to function with no further hardware
error, these errors are transient, and  therefore, do not swap hardware
when a second error occurs within a set period of time.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives to avoid the above mentioned
problem.

Please adhere to the following guidelines to prevent controllers going
offline:

. Recovery Guru must be run from both nodes in the cluster (one at a
  time) after the controller is replaced.  Otherwise the node that did 
  not have recovery guru run will have a device tree (WWN) that is not 
  in sync with the new controller.

. Don't run RM6 commands (including explorer) from multiple hosts at
  the same time.


COMMENTS:

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------