Document fins/I0643-1


FIN #: I0643-1

SYNOPSIS: StorEdge A3x00 configuration with Raid Manager

DATE: Jun/29/01

KEYWORDS: StorEdge A3x00 configuration with Raid Manager


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Sun StorEdge A3x00 configurations with RAID Manager 6.1.x
          may be susceptible to controller "deadlocks".


Sun Alert:          Yes

TOP FIN/FCO REPORT: Yes

PRODUCT_REFERENCE:  StorEdge A3x00 & Raid Manager 6.22 Upgrade

PRODUCT CATEGORY:   Storage /SW Admin

PRODUCTS AFFECTED:

Mkt_ID   Platform   Model   Description                     Serial Number
------   --------   -----   -----------                     -------------

Systems Affected
----------------

  -      ANYSYS       -     System Platform Independent           -

X-Options Affected
------------------

SG-XARY351A-180G     -   -   A3500 1 CONT MOD/5 TRAYS/18GB        -
SG-XARY353A-1008G    -   -   A3500 2 CONT/7 TRAYS/18GB            -
SG-XARY353A-360G     -   -   A3500 2 CONT/7 TRAYS/18GB            -
SG-XARY355A-2160G    -   -   A3500 3 CONT/15 TRAYS/18GB           -
SG-XARY360A-545G     -   -   545-GB A3500 (1X5X9-GB)              -
SG-XARY360A-90G      -   -   A3500 1 CONT/5 TRAYS/9GB(10K)        -
SG-XARY362A-180G     -   -   A3500 2 CONT/7 TRAYS/9GB(10K)        -
SG-XARY362A-763G     -   -   A3500 2 CONT/7 TRAYS/9GB(10K)        -
SG-XARY364A-1635G    -   -   A3500 3 CONT/15 TRAYS/9GB(10K)       -
SG-XARY366A-72G      -   -   A3500 1 CONT/2 TRAYS/9GB(10K)        -
SG-XARY380A-1092G    -   -   1092-GB A3500 (1x5x18-GB)            -
SG-XARY360B-90G      -   -   ASSY,TOP OPT,1X5X9,MIN,9GB,10K       -
SG-XARY360B-545G     -   -   ASSY,TOP OPT,1X5X9,MAX,9GB,10K       -
SG-XARY362B-180G     -   -   X-OPT,2X7X9,MIN,FCAL,9G10K           -
SG-XARY374B-273G     -   -   ASSY,TOP OPT,3X15X9,MIN,9GB,10K      -

PART NUMBERS AFFECTED:

Part Number   Description                             Model
-----------   -----------                             -----

798-0522-01   RAID Manager 6.1.1                        -
798-0522-02   RAID Manager6.1.1 Update 1                -
798-0522-03   RAID Manager6.1.1 Update 2                -

REFERENCES:

BugId:       4374789 4365488

ESC:         527456 527588 527271

SunAlert ID: SA-24267

DOC:         806-7792-12: Sun StorEdge RAID Manager 6.22 Upgrade
             Procedure.

URL:         http://acts.ebay.sun.com/storage/A3500/RM622


PROBLEM DESCRIPTION:

It is possible for A3000 and A3500 (hereafter referred to as A3x00)
controllers running RAID Manager 6.1.x (hereafter referred to as
RM6.1.x) to deadlock causing interruption of data availability.  The
deadlock condition is most likely to occur at times of controller state
changes including failover, failback, mode change from passive to
active or vice versa or upgrade from RM6.1.x to RM6.22. The deadlock is
most likely to occur on A3x00 modules running firmware 02.05.06.39 but
deadlock has also occurred using firmware 02.05.06.32.

This bug can result in an inability to fail LUNs over to an alternate
controller in the case of a data path or controller failure. It can
also result in an inability to upgrade controller firmware levels from
02.05.06.xx to 03.xx through the graphical user interface (GUI) or
command line interface (CLI) which is a required part of the RM6.22
software upgrade procedure.  This bug can also result in controller
deadlock situations during controller mode changes.

All of the above scenarios are critical aspects of A3x00 care and
maintenance.

In 02.05.xx.xx firmware, there is a bug that could cause the A3x00
controllers to hang. These problems can occur on Sun StorEdge A3x00
configurations running RM6.1.1, RM6.1.1 Update 1, or RM6.1.1 Update 2.
Systems running RM6.22 on A3x00 and A3500FC arrays and systems with
A1000's are not affected.

Controller Deadlock Determination
=================================

You may be experiencing a controller deadlock condition if any of the
following symptoms occur:

  . Controller hang while attempting an upgrade from RM6.1.x
    to RM6.22.
  . Cessation of I/O to a controller pair after controller/path
    failover.
  . Cessation of I/O to a controller pair after controller/path
    failback.
  . Cessation of I/O to a controller pair after a controller
    state change either from active to passive or vice versa.
  . Cessation of I/O to a controller pair while attempting
    an online upgrade from RM6.1.x to RM6.22.

    Note: Cessation of I/O to a controller can be indicated in
          various ways which are configuration dependent.  Clues
          can be obtained from console messages, log files,
          and disk or disk tray LED's.

You are NOT experiencing a controller deadlock condition if the
following is true:

  . Upon examination of the /var/adm/messages file, you see
    I/O's being re-routed from the failed controller to the active
    controller.
  . The disk drives in the array are servicing I/O requests from
    the host. A good indication would be blinking LED's on the
    disk drives (in the case of an A3000) or disk drive trays.
  . Unfailing the controller via the RM6 GUI or CLI is successful.
    This will be indicated by an ASC/ASCQ combination of 95/02
    in the status log (GUI) or /usr/lib/osa/rmlog.log file.

The ONLY sure way to determine if the array controllers are being held
in deadlock is to access the controller shell via the serial port.  Use
the following procedure to tell if the array controllers are held in
deadlock:

WARNING: It is VERY important that this NOT be attempted unless
         you are familiar with using the diagnostic port of the Series 3
         RAID Controller!

  . Login to an active or unfailed controller.
  . Type "I" <enter> and look for a task called
"iopathTask" under
    the section "ALL TASKS".
  . Scroll down until you see the section "TASK INFORMATION" that
    corresponds to the "iopathTask".
  . Look for the operations "mode_select" or "write_buffer" in
the
    TASK INFORMATION" section. If you see either of these then
    you have encountered the controller deadlock condition.


Bug Description
===============

When controller B receives a request sense command from the host, it
will allocate a SCSI_Op (a piece of controller memory) from a
pre-allocated memory pool. It will set a flag in SCSI_Op to indicate
the request sense command is a high priority I/O and needs to be
processed immediately.

NOTE: Controller B processes R/W I/Os and high priority I/Os using
      "I/O" paths, and non R/W I/Os using "nonRW" paths.

When controller B completes processing the request sense command, it
does not clear the flag properly. The flag will be permanently set in
SCSI_Op once it is set. (This is the firmware bug)

When the SCSI_Op with the priority flag set is allocated to process
mode select page 2C command (a non-RW I/O used to fail controller A),
the controller will mistakenly consider the mode select command to be a
high priority I/O, and thus process it through "I/O" path. Mode select
is not a high priority I/O and it should be processed through "nonRW"
path.

Since mode select was requested to fail the alternate controller
(controller A), controller B would first suspend all the LUNs (stop R/W
I/Os), and then reconfigure the cache for active/fail mode.

If the cache is dirty, the cache data is required to be flushed through
the "I/O" path. The cache manager would construct I/Os and queue them
to the "I/O" path handler.

However, the "I/O" path is now processing the mode select command,
which will not complete until the cache manager flushed the dirty
cache.  On the other hand, the cache manager can't flush the
cache data because the "I/O" path is busy with the mode select
request (which should not be processed through "I/O" path in the
the first place).

As a result, the controller is trapped in a deadlock.  Eventually, the
host will do a timeout mode select, and retry.  If the controller is
not released from the deadlock, it will only receive I/O but not
process them.

More I/Os will be timed out by the host, and the host will send a BDR
(bus device reset) to the controller.  Since the controller is set to
handle a BDR using a soft reset (determined by an NVSRAM bit - offset
28, bit 5), the controller will only abort all outstanding I/Os, free
all allocated resources, and clear any SCSI reservations and CA
conditions without rebooting the controller. This will not release the
controller from deadlock.

NOTE: The controller will stay in this situation until a bus reset
occurs or the controllers are power-cycled.


IMPLEMENTATION:

         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---


         ---
        | X |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
         ---


         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

An Authorized Enterprise Field Service Representative may avoid the
above mentioned problems for A3x00 controllers running Raid Manager
6.1.x, with firmware 02.05.06.39 or lower by following the
recommendations as shown below.

A. To prevent the occurrence of above failure
=============================================

To avoid the possibility of this problem occurring in the future, an
upgrade to the latest version of the RAID Manager software is
required.  For instructions on upgrading to RM6.22 refer to the
document "Sun StorEdge RAID Manager 6.22 Upgrade Procedure" ,
806-7792-12 which is located at:

       http://acts.ebay.sun.com/storage/A3500/RM622 

NOTE: The "Sun StorEdge RAID Manager 6.22 Upgrade Procedure" is
intended
      for single-node, direct-attach systems, and dual-node,
      direct-attach systems running Sun Cluster 2.2 or VCS software.  
      This procedure is not for use with clusters of more than two nodes.

B. Once the above failure has occurred
======================================

First, determine that a controller deadlock exists by using the
information provided in the Problem Description section above.

When the controllers get into this deadlock situation there are three
infallible ways to return them to a sane state and release them from
the deadlock condition.  For Solaris 2.6  and 7 configuration, the
first option is recommended.  For Solaris 2.5.1 configuration, only the
second and third option is available.

1) For Solaris 2.6, 7 configurations there is a utility located at

       http://acts.ebay.sun.com/storage/A3500/RM622 

This will cause a SCSI bus reset on select SCSI busses.  This script
should be run at the time of the controller deadlock as an immediate
remedy.  Please refer to the README file that is included with the
utility for usage instructions.

  OR

2) Physically power-cycle both of the A3x00 controllers.

  OR

3) Reboot the affected host.


COMMENTS:

----------------------------------------------------------------------------

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to
     contact all affected customers to recommend implementation of
     the FIN.

ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
     support teams will recommend implementation of the FIN  (to their
     respective accounts), at the convenience of the customer.

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as
the
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:

SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.

SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO
index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be
  accessed internally at the following URL: http://edist.corp/.

* From there, follow the hyperlink path of "Enterprise Services
Documenta-
  tion" and click on "FIN & FCO attachments", then choose the
appropriate
  folder, FIN or FCO.  This will display supporting directories/files
for
  FINs or FCOs.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------



Copyright (c) 1997-2003 Sun Microsystems, Inc.