C H A P T E R  7

Domain Status

Status functions return measured values that characterize the state of the server hardware or software. As such, these functions are used both to provide values for status displays and input to monitoring software that periodically polls status functions and verifies that the values returned are within normal operational limits. Monitoring and event detection functions that use the status functions are described in this chapter.

Software Status

The software state consists of status information provided by the software running in a domain. The identity of the software component currently running (for example, POST, OpenBoot PROM, or Solaris software) is available. Additional status information is available (booting, running, panicking).

SMS software provides the following command(s) to display the status of the software, if any, currently running in a domain.

Status Commands

showboards Command

showboards (1M) displays the assignment information and status of the DCUs. These include the following: Location, Power, Type of board, Board status, Test status and Domain.

If no options are specified, showboards displays for the platform administrator, all DCUs including those that are assigned or available . For the domain administrator or configurator, showboards displays only those DCUs for those domains for which the user has privileges, including those boards that are assigned or available and in the domain's available component list.

If domain_id | domain_tag is specified, this command displays which DCUs are assigned or available to the given domain. If the -a option is used, showboards displays all boards including DCUs.

For examples and more information, see To Obtain Board Status and refer to the showboards manpage.

showdevices Command

showdevices (1M) displays configured physical devices on system boards and the resources made available by these devices. Usage information is provided by applications and subsystems that are actively managing system resources. The predicted impact of a system board DR operation may be optionally displayed by performing an offline query of managed resources.

showdevices gathers device information from one or more Sun Fire 15K domains. The command uses the dca (1M) as a proxy to gather the information from the domains.

For examples and more information, see To Obtain Device Status and refer to the showdevices manpage.

showenvironment Command

showenvironment (1M) displays environmental data including: Location, Device, Sensor, Value, Unit, Age, Status. For fan trays, Power, Speed and Fan Number are displayed. For bulk power, the Power, Value and Unit and Status are shown.

If a domain domain_id | domain_tag is specified, environmental data relating to the domain is displayed, providing that the user has domain privileges for that domain. If a domain is not specified, all domain data permissible to the user will be displayed.

DCUs (for example, CPU, I/O) belong to a domain and you must have domain privileges to view their status. Environmental data relating to such things as fan trays, bulk power, or other boards are displayed without domain permissions. You can also specify individual reports for: temperatures, voltages, currents, faults, bulk power status, and fan tray status with the -p option. If the -p option is not present, all reports will be shown.

For examples and more information, see Environmental Status and refer to the showenvironment man page.

showobpparams Command

showobpparams (1M) displays OpenBoot PROM bringup parameters. showobpparams allows a domain administrator to display the virtual NVRAM and REBOOT parameters passed to OpenBoot PROM by setkeyswitch (1M).

For examples and more information, see Setting the OpenBoot PROM Variables and refer to the showobpparams man page.

showplatform Command

showplatform (1M) displays the available component list and domain state of each domain.

A domain is identified by a domain_tag if one exists. Otherwise it is identified by the domain_id , a letter in the set A - R. The letter set is case insensitive. The Solaris hostname is displayed if one exists. If a hostname has not been assigned to a domain, Unknown is printed.

The following is a list of domain statuses:

Domain status reflects two cases. The first is that dsmd is busy trying to recover the domain and the second is that dsmd has given up trying to recover the domain. In the second case you will always see "Domain Down." In the first case you will either see "Domain Down" or some other status. To recover from a "Domain Down" in either case, use:

sc0:sms-user:> setkeyswitch off
sc0:sms-user:> setkeyswitch on

setkeyswitch off, setkeyswitch on.

For examples and more information, see To Obtain Domain Status and refer to the showplatform man page.

showxirstate Command

showxirstate (1M) displays CPU dump information after sending a reset pulse to the processors. This save state dump can be used to analyze the cause of abnormal domain behavior. showxirstate creates a list of all active processors in that domain and retrieves the save state information for each processor.

showxirstate data resides, by default, in /var/opt/SUNWSMS/adm/ domain_id /dump .

For examples and more information, refer to the showxirstate man page.

Solaris Software Heartbeat

During normal operation, the Solaris environment produces a periodic heartbeat indicator readable from the SC. dsmd detects the absence of heartbeat updates for a running Solaris system as a hung Solaris. Hangs are not detected for any software components other than the Solaris software.

Note Note - The Solaris software heartbeat should not be confused with the SC-to-SC (hardware) heartbeat or the heartbeat network, both used to determine the health of failover. For more information see, SC Heartbeats.

The only reflection of the Solaris heartbeat occurs when dsmd detects a failure to update the Solaris heartbeat of sufficient duration to indicate that the Solaris software is hung. Upon detection of a Solaris software hang, dsmd will conduct an ASR.

Hardware Status

The hardware status functions report information about the hardware configuration, hardware failures detected, and platform environmental state.

Hardware Configuration

The following hardware configuration status is available from the Sun Fire 15K system management software:

Note Note - The hardware configuration status available to SMS running on the SC includes no information about the I/O configuration; such as, where I/O adaptors are plugged in and what devices are attached to those I/O adaptors. Such information is available only to the software running on the domain that owns the
I/O adaptor.

The hardware configuration supported by functions described in this section exclude I/O adaptors and I/O devices. showboards displays all hardware components that are present.

As described in Blacklist Editing , the current contents of the component blacklist(s) can always be viewed and altered.

Environmental Status

The following hardware environmental measurements are available:

showenvironment displays every environmental measurement that can be taken within the Sun Fire 15K rack.

procedure icon  To Display the Environment Status for Domain A

1. Log in to the SC.

Platform administrators can view any environment status on the entire platform. Domain administrators can see the environment status only for those domains for which they have privileges.

2. Type:

sc0:sms-user:> showenvironment -d A

As described in HPU LEDs , the operating indicator LEDs on Sun Fire 15K HPUs visibly reflect that the HPUs are powered on and the OK to remove indicator LEDs visibly reflect those that can be unplugged.

Hardware Error Status

dsmd monitors the Sun Fire 15K hardware operational status and reports errors. The occurrence of some errors are directly reported to the SC (for example, the error register(s) in every ASIC propagate to the SBBC on the SC that provides an error summary register). Although the occurrence of some errors is indicated by an interrupt delivered to the SC, some error states may require the SC to monitor hardware registers for error indications. When a hardware error is detected, esmd follows the established procedures for collecting and clearing the hardware error state.

The following types of errors can occur on Sun Fire 15K hardware:

Hardware error status is generally not reported as a status. Rather, event handling functions perform various actions when hardware errors occur such as logging errors, initiating ASR, and so forth. These functions are discussed in Chapter 9 .

Note Note - As described inHPU LEDs, the fault LEDs, after POST completion, identify Sun Fire 15K HPUs in which faults have been discovered since last powered on or submitted to a power on reset.

SC Hardware and Software Status

Proper operation of SMS depends upon proper operation of the hardware and the Solaris software on the SC. The ability to support automatic failover from the main to the spare system controller requires properly functioning hardware and software on the spare. SMS software running on the main system controller must either be functioning sufficiently to diagnose a software or hardware failure in a manner that can be detected by the spare or it must fail in a manner that can be detected by the spare.

SC-POST determines the status of system controller hardware. It tests and configures the system controller at power-on or power-on reset.

The SC will not boot if the SC fails to function.

If the control board fails to function, the SC will normally boot, but without access to the control board devices. The level of hardware functionality required to boot the system controller is essentially the same as that required for a standalone SC.

SC-POST writes diagnostic output to the SC console serial port (TTY-A). Additionally, SC-POST leaves a brief diagnostics status summary message in an NVRAM buffer that can be read by a Solaris driver and logged and/or displayed when the Solaris software boots.

SC firmware and software display information to identify and service SC hardware failures.

SC firmware and software provide a software interface that verifies that the system controller hardware is functional. This selects a working system controller as main in a high-availability SC configuration.

The system controller LEDs provide visible status regarding power and detected hardware faults as described in HPU LEDs .

Solaris software provides a level of self-diagnosis and automatic recovery (panic and reboot). Solaris software utilizes the SC hardware watchdog logic to trap hang conditions and force an automatic recovery reboot.

There are three hardware paths of communication between the SCs (two Ethernet connections, the heartbeat network, and one SC-to-SC heartbeat signal) that are used in the high-availability SC configuration by each SC to detect hangs or failures on the other SC.

SMS practices self-diagnosis and institutes automatic failure recovery procedures; even in non-high-availability SC configurations.

Upon recovery, SMS software either takes corrective actions as necessary to restore the platform hardware to a known, functional configuration or reports the inability to do so.

SMS software records and logs sufficient information to allow engineering diagnosis of single-occurrence software failures in the field.

SMS software takes a noticeable interval to initialize itself and become fully functional. The user interfaces behave predictably during this interval. Any rejections of user commands are clearly identified as due to system initialization with advice to try again after a suitable interval.

SMS software implementation uses a distributed client/server architecture. Any errors encountered during SMS initialization, due to attempts to interact with a process that has not yet completed initialization, are dealt with silently.