|C H A P T E R 7|
Status functions return measured values that characterize the state of the server hardware or software. As such, these functions are used both to provide values for status displays and input to monitoring software that periodically polls status functions and verifies that the values returned are within normal operational limits. Monitoring and event detection functions that use the status functions are described in this chapter.
The software state consists of status information provided by the software running in a domain. The identity of the software component currently running (for example, POST, OpenBoot PROM, or Solaris software) is available. Additional status information is available (booting, running, panicking).
If no options are specified, showboards displays for the platform administrator, all DCUs including those that are assigned or available . For the domain administrator or configurator, showboards displays only those DCUs for those domains for which the user has privileges, including those boards that are assigned or available and in the domain's available component list.
For examples and more information, see To Obtain Board Status and refer to the showboards manpage.
showdevices (1M) displays configured physical devices on system boards and the resources made available by these devices. Usage information is provided by applications and subsystems that are actively managing system resources. The predicted impact of a system board DR operation may be optionally displayed by performing an offline query of managed resources.
For examples and more information, see To Obtain Device Status and refer to the showdevices manpage.
showenvironment (1M) displays environmental data including: Location, Device, Sensor, Value, Unit, Age, Status. For fan trays, Power, Speed and Fan Number are displayed. For bulk power, the Power, Value and Unit and Status are shown.
If a domain domain_id | domain_tag is specified, environmental data relating to the domain is displayed, providing that the user has domain privileges for that domain. If a domain is not specified, all domain data permissible to the user will be displayed.
DCUs (for example, CPU, I/O) belong to a domain and you must have domain privileges to view their status. Environmental data relating to such things as fan trays, bulk power, or other boards are displayed without domain permissions. You can also specify individual reports for: temperatures, voltages, currents, faults, bulk power status, and fan tray status with the -p option. If the -p option is not present, all reports will be shown.
For examples and more information, see Environmental Status and refer to the showenvironment man page.
showobpparams (1M) displays OpenBoot PROM bringup parameters. showobpparams allows a domain administrator to display the virtual NVRAM and REBOOT parameters passed to OpenBoot PROM by setkeyswitch (1M).
For examples and more information, see Setting the OpenBoot PROM Variables and refer to the showobpparams man page.
A domain is identified by a domain_tag if one exists. Otherwise it is identified by the domain_id , a letter in the set A - R. The letter set is case insensitive. The Solaris hostname is displayed if one exists. If a hostname has not been assigned to a domain, Unknown is printed.
Running Domain POST
In OBP Callback
Domain Exited OBP
OBP in sync Callback to OS
In OBP Error Reset
Solaris Halted in OBP
Environmental Domain Halt
Booting Solaris Failed
Loading Solaris Failed
Solaris Quiesce In-Progress
Solaris Resume In-Progress
Solaris Panic Debug
Solaris Panic Continue
Solaris Panic Dump
Solaris Panic Exit
Domain status reflects two cases. The first is that dsmd is busy trying to recover the domain and the second is that dsmd has given up trying to recover the domain. In the second case you will always see "Domain Down." In the first case you will either see "Domain Down" or some other status. To recover from a "Domain Down" in either case, use:
For examples and more information, see To Obtain Domain Status and refer to the showplatform man page.
showxirstate (1M) displays CPU dump information after sending a reset pulse to the processors. This save state dump can be used to analyze the cause of abnormal domain behavior. showxirstate creates a list of all active processors in that domain and retrieves the save state information for each processor.
During normal operation, the Solaris environment produces a periodic heartbeat indicator readable from the SC. dsmd detects the absence of heartbeat updates for a running Solaris system as a hung Solaris. Hangs are not detected for any software components other than the Solaris software.
Note Note - The Solaris software heartbeat should not be confused with the SC-to-SC (hardware) heartbeat or the heartbeat network, both used to determine the health of failover. For more information see, SC Heartbeats.
The only reflection of the Solaris heartbeat occurs when dsmd detects a failure to update the Solaris heartbeat of sufficient duration to indicate that the Solaris software is hung. Upon detection of a Solaris software hang, dsmd will conduct an ASR.
Hardware components physically present on each board (as detected by POST)
Hardware components not in use because they failed POST
Presence or absence of all hot-pluggable units (HPUs) (for example, system boards)
Hardware components not in use because they were on the blacklist when POST was invoked (see Power-On Self-Test (POST) )
Contents of the SEEPROM for each FRU including the part number and serial number
Note Note - The hardware configuration status available to SMS running on the SC includes no information about the I/O configuration; such as, where I/O adaptors are plugged in and what devices are attached to those I/O adaptors. Such information is available only to the software running on the domain that owns the
As described in Blacklist Editing , the current contents of the component blacklist(s) can always be viewed and altered.
Power voltage and amperage
Fan status (stopped, low-speed, high-speed, failed)
As described in HPU LEDs , the operating indicator LEDs on Sun Fire 15K HPUs visibly reflect that the HPUs are powered on and the OK to remove indicator LEDs visibly reflect those that can be unplugged.
dsmd monitors the Sun Fire 15K hardware operational status and reports errors. The occurrence of some errors are directly reported to the SC (for example, the error register(s) in every ASIC propagate to the SBBC on the SC that provides an error summary register). Although the occurrence of some errors is indicated by an interrupt delivered to the SC, some error states may require the SC to monitor hardware registers for error indications. When a hardware error is detected, esmd follows the established procedures for collecting and clearing the hardware error state.
Domain stops, fatal hardware errors that terminate all hardware operations in a domain
Record stops that cause the hardware to stop collecting transaction history when a data transfer error (for example, parity) occurs
SPARC processor error conditions such as RED_state/watchdog reset
Nonfatal ASIC-detected hardware failures
Hardware error status is generally not reported as a status. Rather, event handling functions perform various actions when hardware errors occur such as logging errors, initiating ASR, and so forth. These functions are discussed in Chapter 9 .
Note Note - As described inHPU LEDs, the fault LEDs, after POST completion, identify Sun Fire 15K HPUs in which faults have been discovered since last powered on or submitted to a power on reset.
Proper operation of SMS depends upon proper operation of the hardware and the Solaris software on the SC. The ability to support automatic failover from the main to the spare system controller requires properly functioning hardware and software on the spare. SMS software running on the main system controller must either be functioning sufficiently to diagnose a software or hardware failure in a manner that can be detected by the spare or it must fail in a manner that can be detected by the spare.
If the control board fails to function, the SC will normally boot, but without access to the control board devices. The level of hardware functionality required to boot the system controller is essentially the same as that required for a standalone SC.
SC-POST writes diagnostic output to the SC console serial port (TTY-A). Additionally, SC-POST leaves a brief diagnostics status summary message in an NVRAM buffer that can be read by a Solaris driver and logged and/or displayed when the Solaris software boots.
SC firmware and software provide a software interface that verifies that the system controller hardware is functional. This selects a working system controller as main in a high-availability SC configuration.
The system controller LEDs provide visible status regarding power and detected hardware faults as described in HPU LEDs .
Solaris software provides a level of self-diagnosis and automatic recovery (panic and reboot). Solaris software utilizes the SC hardware watchdog logic to trap hang conditions and force an automatic recovery reboot.
There are three hardware paths of communication between the SCs (two Ethernet connections, the heartbeat network, and one SC-to-SC heartbeat signal) that are used in the high-availability SC configuration by each SC to detect hangs or failures on the other SC.
SMS software takes a noticeable interval to initialize itself and become fully functional. The user interfaces behave predictably during this interval. Any rejections of user commands are clearly identified as due to system initialization with advice to try again after a suitable interval.
SMS software implementation uses a distributed client/server architecture. Any errors encountered during SMS initialization, due to attempts to interact with a process that has not yet completed initialization, are dealt with silently.