SRDB ID   Synopsis   Date
17479   Resolving Hardhang problems on Ultra[TM] Servers   13 Dec 2002

Status Issued

Description
Problem Definition
------------------
None of the terminals are responding, console does not respond, ping/telnet 
does not respond, Stop-A does not break to OBP, "send break" from a tip line 
does not break into the OBP.  If all of the above are tried and fail to break 
out of the hang then the system is really hosed.  It is almost impossible 
for Engineering to figure out the cause of the hang if there is no core 
file to analyze.


Keywords: kernel, hang, hard, core, obp, ok
                       
SOLUTION SUMMARY:
Solutions
---------
These steps do not provide the final solution nor detect the cause of the
hardhang, but they will help in getting a core file to analyze the problem.
In all the cases listed below, once you are in OBP type "sync" to get the
core dump.  If the system was booted with kadb, then do some initial
analysis and then $q to enter OBP.

NOTE: 
This document was written specifically for sun4u architecture systems.  
While many of these instructions will be applicable to other architectures, 
some will not.  XIR is only available on Ultra Enterprise[TM] systems.


Options
-------

 1. Enable Deadman
 2. Set Breakpoint
 3. Install Hardhang Kernel
 4. XIR


1. Enable Deadman
-----------------
Deadman code is enabled by setting snooping in /etc/system.  Make the 
following entries in the /etc/system file:
    set snooping = 1 
    set snoop_interval = 9000000                       

The snooping=1 entry enables the deadman code.

The snoop_interval=9000000 entry will enable the

deadman after 90 seconds (against the default of 500 seconds)

of system inactivity (no clock interrupts).

Reboot the system with kadb:
    ok boot kadb

When the next hang occurs, hopefully the deadman timer will be triggered, 
and the system will drop into kadb:
        # ~stopped      at      0xfbd01028:     ta      0x7d
        kadb[0]: 

At this point, any specific debugger commands can be run to examine the 
current state of the system.  Of particular interest are:
        $r              dump the registers
        $c              dump the current stack backtrace
        freemem/D       see how much memory is free

When kadb debugging is complete, attempt to take a core dump by doing:
        kadb[0]: $q
                
        ok sync
                 

As of Solaris[TM] 8 the system will no longer drop to the ok prompt but will initiate a panic sequence that will

set the panic string to "deadman: timed out after %d seconds of clock inactivity" and create the core image,

and reset the system.

Pros:    Easy to enable. 
Cons:   Requires a system reboot.
    Cannot break to kadb/obp if the level 14 interrupt is blocked.
    Cannot break a hang caused by a device other than the cpu seizing a
    system bus.


2. Set Breakpoint
-----------------
The system should have been booted with kadb.  After the system comes up, get 
into kadb (Stop-A/"send break") and set a breakpoint in system_high_handler().
This function is only invoked on level 15 interrupts and is associated with
fan fails and system board detection.

To set the breakpoint in kadb:
    kadb: (type return)
    kadb[0]: system_high_handler:b
    kadb[0]: :c


When the system hardhangs again, follow the procedure described in the 
section "Generating a Level 15 Interrupt".


Pros:   Will succeed in some instances where 'snooping' does not.
Cons:   Requires reboot if kadb not enabled.
    Requires a free system board slot.
    Cannot break a hang caused by a device other than the cpu seizing a
    system bus.
    Will fail if level 15 interrupts have been masked out.


3. Install Hardhang Kernel
--------------------------
A special kernel needs to be built and installed at the customer site.
Additionally, the breakpoint in system_high_handler() should be set through
kadb (see the above section "Set Breakpoint").

Now the system has been setup to break out of the hang.  Should the system 
hardhang, follow the procedure described in the section "Generating a Level 
15 Interrupt".


Pros:   Will succeed even if all the interrupts are masked. 
Cons:   Requires a custom kernel.
    Will fail if all the CPUs have PSTATE_IE = 0.
    Requires a free system board slot.
    Cannot break a hang caused by a device other than the cpu seizing a
    system bus.

4. XIR
------
This is the last resort in case the interrupts have been disabled.  XIR is
a non maskable interrupt and will definitely break the system out of the
hang.  Unfortunately this method also clears memory and hence a core dump
cannot be taken.  But this does provide some info about the CPU state at the
time of hang.  

The remote External Initiated Reset (XIR) command "Although limited in 
its current form" can be used to aid Software debugging of hung systems.  
Currently XIR stores the following information for each CPU:

    TL       (Trap Level)
    TT       (Trap Type)
    TPC      (Program Counter
    TNPC     (Next Program counter)
    TSTATE   (Trap State Register)

This information is then gathered by typing .xir-state-all in the OBP.
(You may need to Stop-A/"send break"to the machine to stop the machine from
rebooting in order to issue this command.)

There are 2 methods for initiating the XIR:

 Method 1:
 
 Press the XIR pin in the clock board which is at the rear of the E4000, 
 (the FE handbook notes the location of the XIR switch).  To the right side 
 of the XIR switch is the POR switch; DO NOT press it, it will cycle power.
 When XIR is pressed the system will come to the "ok" prompt (or wait
 until it comes to the "ok" prompt).  This method is easier than entering 
 the key sequences noted in method 2.

 Method 2:

 Press Return key (twice)
 Press ~ key (once, possibly twice)
 Press Control-Shift-X keys (together)
  
 This key sequence should reboot the system.  At this point, you'll need to 
 do a Stop-A/"send break" to get to the OK prompt.  


Once the system is at the OBP prompt, get the CPU state info:

    ok .xir-state-all

NOTE:   This information must be manually copied.

Then go to the following website: http://otis.uk/cgi-bin/xir-cgi.tcl to
get the details on what to do with this information.


Pros:   Will break out of the hang. 
Cons:   Will not be able to get a useful core file.


Generating a Level 15 Interrupt
-------------------------------
On a sun4u architecture system, a level 15 interrupt is generated when a
system board is inserted.  This interrupt is also generated by a fan failure,
on both the sun4u and sun4d architectures, but since the fans are not easily
accessible, board insertion is the method described here.  If, however, the
system in question is a sun4d, then disconnecting a fan will be the only
method available for generating a Level 15 interrupt.

When the system hangs, insert a system board into a free slot.  This will 
generate a level 15 interrupt, which should trigger the breakpoint in kadb.

Once in kadb, debugger commands can be run to examine the current state of 
the system.  Of particular interest are:
        $r              dump the registers
        $c              dump the current stack backtrace
        freemem/D       see how much memory is free

When kadb debugging is complete, attempt to take a core dump by doing:
        kadb[0]: $q
                
        ok sync



WARNING:
If a non-forced level 15 interrupt should occur on the system while the 
breakpoint is set or the debug kernel is in place, then the system will 
break to the OBP/kadb prompt.  The system cannot be used until control is
returned to the kernel, by typing "go" at the OBP, or :c at the kadb prompt.


                       
SUBMITTER: Nancy A LeBlanc APPLIES TO: Hardware/Ultra Enterprise/Servers, Operating Systems/Solaris/Solaris 2.5.1, AFO Vertical Team Docs, AFO Vertical Team Docs/Kernel ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.