SRDB ID   Synopsis   Date
48493   Sun Fire[TM] 12K/15K: Dstop: CDC indicates an owner outside the domain   1 Nov 2002

Status Issued

Description
- Problem Statement:

    Dstop: CDC indicates an owner outside the domain

- Symptoms:

   'wfail' output reports something similar to the following:

       01  redxl> dumpf load dsmd.dstop.020510.0947.08
       02  Created Fri May 10 09:47:10 2002
       03  By hpost v. 1.2 Generic 112488-04 Mar 18 2002 14:43:00  executing as pid=6825
       04  On ssc name =  rasputin-sc0.SD_RASCAL.West.Sun.COM
       05  Domain =  0=A    Platform = rasputin
       06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
       07            EXB[17:0]: 12100
       08          Slot0[17:0]: 12100
       09          Slot1[17:0]: 12100
       10  -D option, -d
       11  "DSMD DomainStop Dump"
       12  0 errors occurred while creating this dump.
       13  redxl> wfail
       14  SDI EX08/S0  Master_Stop_Status0[31:0] = E004000F
       15          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
       16  SDI EX08/S0  Dstop0[31:0] = 04218400
       17          Dstop0[16]: D    DARB texp requests all Dstop (M)
       18          Dstop0[21]: D    SDI internal STB port requested Dstop
       19          Dstop0[26]: D 1E AXQ requests Slot0 Dstop (M)
       20  SDI EX08/S0  Recordstop0[31:0]  = 00818080
       21          Rstop0[16]: R    DARB texp request Recordstop (M)
       22          Rstop0[23]: R 1E AXQ requests all Recordstop (M)
       23  AXQ EX08 ( 8) Error_Flag_07[31:0] = 020B8200  Mask = 63FF7D24
       24          Err7[16]: R    CDC0 correctable error
       25          Err7[17]: R    CDC0 address parity error
       26          Err7[19]: R    CDC1 correctable error
       27          Err7[25]: R 1E CDC uncorrectable error
       28  AXQ EX08 ( 8) Error_Flag_08[31:0] = 20002000  Mask = 0000FFFF
       29          Err8[29]: D    CDC indicates an owner outside the domain
       30  FAIL CDC Dimm EX8:  Dstop/Rstop detected by AXQ.
       31  Primary service FRU is EXB EX8.
       32  SDI EX13/S0: All SDI is DStopped and RStopped,         requested by DARB.
       33  SDI EX16/S0: All SDI is DStopped and RStopped,         requested by DARB.
       34  DARB C0: enabled ports (expanders)          [17:0]: 16100
       35  DARB C0: exps request Dstop+Rstop           [17:0]: 00100
       36  DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00100
       37  DARB C1: enabled ports (expanders)          [17:0]: 16100
       38  DARB C1: exps request Dstop+Rstop           [17:0]: 00100
       39  DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00100      

SOLUTION SUMMARY:
- Troubleshooting:

    The dump header tells us that this Dstop was generated by dsmd (lines 10,11) 
    while a domain was active. This is also evident by the dumpf file name - 
    dsmd.dstop files are created by dsmd as part of an ASR. Walking the
    error chain:

     - Master SDI on EX8 is directed to Dstop by AXQ8 (line 19)
     - Master SDI on EX8 is directed to Rstop by AXQ8 (line 22)
     - AXQ8 reports several CDC related errors, all indicating Rstop (lines 24-27)
     - AXQ8 reports a fatal error in the CDC (line 29)
     - The CDC DIMM is FAILed from the configuration (line 30)
     - EX8 is named as the FRU (line 31)

    The CDC DIMM is divided into 3 SRAMs, read in parallel, forming a 3-way
    set associative cache. CDC entries contain information about lines of
    memory recently referenced by SSM logic.

    Any error (correctable or uncorrectable) in the CDC is recorded and
    logged, but never causes a Dstop. Entries with correctable errors are
    written back with the corrected data. Uncorrectable errors are treated
    as cache misses. Notice that all the errors recorded in AXQ8's Err7 
    register (lines 24-27) are all Recordstop events ('R' precedes the
    error description).

    However, from the name of the dump file (line 01) and the dsmd action
    (line 11), we know this is a Dstop. The Dstop is triggered because the
    data in the CDC indicates the owner of a cache line is a board that is
    not in the resources comprising this domain. Either AXQ8 wrote the
    offending error, or the CDC entry has been trashed. In either scenario,
    this fault is deemed serious (coherency within the domain is 
    in question), thus the Dstop. So, although the first error is a 
    Recordstop (line 27), because another error requiring Dstop occurs, 
    the stop acted upon is a Dstop.

    In this case, because of the sheer number of CDC-related errors, it is
    clear that the CDC is in dire straits. That's why the CDC DIMM is FAILed
    from the configuration (line 30). The CDC is not a FRU, so the expander
    must be replaced (line 31).

    Also note the blacklisting suggestion made by wfail:

       40  redxl> wfail -B
       41  membrd           SB8                     # redx wfail of dump 020510.0947.10

    By not using memory on SB8, there is no home memory within EX8. Thus,
    the CDC DIMM on EX8 is not used.
     
- Resolution:

    Repair/replace EX8.       

- Summary of part number and patch ID's 

    http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html       
        
- References and bug IDs

    SunSolve Article 48122 
    SDI ASIC Specification
    Starcat Architecture, 11/07/2000
  

- Additional background information:

    The details of the CDC DIMM entries in error is available in the AXQ data
    capture. First, understand the format of a CDC entry:

     SHARED ENTRY
     ============

       |3|                  2|                  1|                   |
       |0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|
       +-+-----------------------------------+-----------------------+
       |1|       Bitmask of sharers          |          Tag          |
       +-+-----------------------------------+-----------------------+

     OWNED ENTRY
     ============

       |3|                  2|                  1|                   |
       |0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|
       +-+---------------------+-+-+---------+-----------------------+
       |0|        Unused       |V|R|  Owner  |          Tag          |
       +-+---------------------+-+-+---------+-----------------------+

         V = Valid Entry (1 = valid, 0 = invalid)
         R = Retention priority

    Bit [30] indicates if the line is Owned or Shared. In Owned entries, 
    bits [16:12] indicate the boardset that owns the line. The owner field 
    is only valid if bit [18] is set. In Shared entries, bits [29:12] 
    indicate which boardsets contain a copy of the cache line. The bit 
    indicating a shared entry (bit [30]) implies a valid entry. 

    A 3-wide CDC entry spanning the 3 CDC SRAMs is further protected by
    8 bit ECC. A 3-wide entry also uses 3 bits of LRM (Least Recently
    Modified) to help in selection of an entry during victimization.

    Examine this dump example:

       42  redxl> shaxq -e 8
       43  Note: Data is displayed from the currently loaded dump file.
       44  AXQ  EX8 (8)   Component ID = C4312049   Rev 6.0
       45                Error_Flag_00[31:0] = 00000000  Mask = 0000FFFF
       46                Error_Flag_01[31:0] = 00000000  Mask = 4000FFFF
       47                Error_Flag_02[31:0] = 00000000  Mask = 0000FFFF
       48                Error_Flag_03[31:0] = 00000000  Mask = 21005EFF
       49                Error_Flag_04[31:0] = 00000000  Mask = 01FEFFFF
       50                Error_Flag_05[31:0] = 00000000  Mask = 1024FFFF
       51                Error_Flag_06[31:0] = 00000000  Mask = 7E00FFFF
       52                Error_Flag_07[31:0] = 020B8200  Mask = 63FF7D24
       53          Err7[16]: R    CDC0 correctable error
       54              CDC error count[3:0] = A  Read Addr[18:0] = 19172 (GoodApar= 0)
       55              CDC 0 sram data[35:0] = E.D0000C9E
       56              CDC0 entry: Shared, Mask = 10000, Tag = C9E
       57              CDC 1 sram data[35:0] = F.50000E1E
       58              CDC1 entry: Shared, Mask = 10000, Tag = E1E
       59              CDC 2 sram data[35:0] = A.50000D9E
       60              CDC2 entry: Shared, Mask = 10000, Tag = D9E
       61              ECC Syndrome[7:0] = 88: Uncorrectable Error
       62              cdc_errsave1[19]: Capture is for Outside Domain Error
       63              LRU[3:0] = A
       64              cdc_errsave0[3:0][31:0] = 6D0000 C9EF5000 0E1EA500 00D9E880
       65              cdc_errsave1[31:0] = 0A099172
       66          Err7[17]: R    CDC0 address parity error
       67              CDC error save data is displayed above.
       68          Err7[19]: R    CDC1 correctable error
       69              CDC error save data is displayed above.
       70          Err7[25]: R 1E CDC uncorrectable error
       71              CDC error save data is displayed above.
       72                Error_Flag_08[31:0] = 20002000  Mask = 0000FFFF
       73          Err8[29]: D    CDC indicates an owner outside the domain
       74              CDC error save data is displayed above.
       75                Error_Flag_09[31:0] = 00000000  Mask = 7E00FFFF
       76                Error_Flag_10[31:0] = 00000000  Mask = 7C00FFFF
       77                Error_Flag_11[31:0] = 00000000  Mask = 7FF0FFFF

    Let's focus on the CDC0 entry (lines 55,56). The CDC0 entry is decoded by
    redx. We see the line is shared and the Mask is 10000. Thus, SB16 is the
    only sharer for this cache line. By our dump header, SB16 is in the domain
    (line 08). So by the data capture, there is no indication of the owner
    being outside the domain resources. Since the first error was an uncorrectable
    ECC error, the data in the capture is likely from that event. Subsequent
    CDC errors are not captured until after the dump is collected and the 
    ASICs are rearmed.

    Note that this fault was injected by grounding part of the pathway between
    the CDC SRAM and AXQ. Fault injection aside, if a "CDC indicates an owner 
    outside the domain" error, it implies one of two things:
        
       o The AXQ is writing faulty ownership/shared to the CDC entries
       o Multiple flips occurred in a CDC entry 

    In any case, the expander is the FRU.
        
        
- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
starcat, dstop, CDC indicates an owner outside the domain

         

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.