SRDB ID   Synopsis   Date
18300   Troubleshooting hung E10K domains   23 Jan 1999

Status Issued

Description
I think my domain is hung. How can I tell?
Are there any commands I should run that will help collect information
on the reason for the hang?
SOLUTION SUMMARY:
Hung Domains
------------------------------------------------------------

To recover from a hung domain, you must be logged into the SSP as user ssp 
with two log-in sessions. Both log-in sessions must have their environment pointed
at the correct domain. Use the 'domain_status' and 'domain_switch' commands
to set up the environment. Make one session be the system console for that
domain with 'netcon' command. Use the other for all SSP commands.

To Determine if a Domain is Hung

1. Verify that 'netcon' will respond to a carriage return and that
you can ping the domain from the SSP.

If you cannot perform these functions, you either have system problem
or a hung domain.

  a. System problems can be confirmed by checking the power status
  and 'hostview' warnings.

  b. A hung domain can be confirmed by issuing a 'telnet' to the
  domain.

2. Use Unix command to determine the cause of sluggish behavior.

Use the 'ps -elf' to look for slow processes, 'df -lk' to check file system
usage, and 'who' to determine who is current user and what processes they
are running.


Useful 'netcon' toggles

~? = Show status and communication path

~= = switch to jtag from network if using network. If jtag switch to network

~. = exit out of netcon

~# = L1A or Stop A (ie drop to OBP)

~@ = get write permission

~* = kill all netcon sessions but yours



To Recover from a Hung Domain
------------------------------------------------------------

1. Issue the following SSP commands and save the output:

   'domain_status '
   'check_host -v '
   'hostinfo -h '
   'hostinfo -S '

Run the last command three times, waiting a few seconds between
each command

NOTES: hostinfo -h tells you what boards are the platform has.
       hostinfo -S shows you the "heartbeat" from each processors
       "Signature Block". Looking only at the boards in your domain
       and only at procs currently configured, you should see
       the "heartbeat" number increment over time. If no change then
       that domain is dead. 
  
2. Attempt to force the domain into OBP by typing:

    'sigbcmd -f -p[processor id] obp '

See NOTE 1: Below for how to use the -p option

Observe netcon session activity. If you see the OBP ok> prompt, then
the sigbcmd was successful. Allow a few minutes for this command to
run. If it does not work, keep repeating step 2 with other processors from 
that hung domain

   a. If the 'sigbcmd' worked, issue the following sequence of commands.
   When finished, save the contents of the window buffer to a file, or
   cut and past them to a file. This data is useful in analyzing the cause
   of the hang condition.


   'ctrace'

   This will give you trace before going to OBP. Symbols will not be available
   if kernel is non-debug kernel.


   '.registers '


   This command gives you global register dump at the time of entering OBP.


   '.locals '

   This command gives you local register dump at the time of entering OBP.


   'sync '

   sync issues a callbk to the kernel to get a core dump. The system
   should dump core and reboot after issuing this OBP command.

   
   b. If the 'sigbcmd' command did not work, attempt to force the system
   panic with the 'hostinit' command on the SSP.

   c. If the 'hostint' command does not work, try the 'sigbcmd panic'
   This is a more forecful version of the 'hostinit' command.


3. When all else fails, issue a bringup command to restore the domain to
operation.


==========
NOTE 1:     The -p option is for processor. You must choose a processor from
            the hung domain and "NOT A ACTIVE DOMAIN". It is also best to try a
            processor other then the boot processor.

            To get board numbers do:
            'domain_status' cmd

DOMAIN          TYPE                     PLATFORM       OS     SYSBDS
sun           Ultra-Enterprise-10000     test           2.6    0 2 4


            Example of processors numbers from a domain of boards 0 2 4:

            Processors for board 0:  0  1  2  3
            Processors for board 2:  8  9 10 11
            Processors for board 4: 16 17 18 19

            board# x 4 = starting proc for that board
==========

==========
ADDITIONAL NOTES: 

All commands are enclosed by ' '

It is recommended you read the man pages for noted comands.
==========
INTERNAL SUMMARY:
This information was taken from the "Enterprise 10000 Advanced System Service
Manual", May, 1998. Some parts were expanded to better explain what is 
presented here.
SUBMITTER: Stephen Taylor APPLIES TO: Hardware/Ultra Enterprise/Servers/Enterprise 10000, AFO Vertical Team Docs, AFO Vertical Team Docs/Kernel ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.