C H A P T E R  9

Domain Events

Event monitoring periodically checks the domain and hardware status to detect conditions that need to be acted upon. The action taken is determined by the condition and can involve reporting the condition or initiating automated procedures to deal with it. This chapter describes the events that are detected by monitoring and the requirements with respect to actions taken in response to detected events.


Message Logging

SMS logs all significant events, such as those that require taking actions other than logging or updating user monitoring displays in message files on the SC. Included in the log is information to support subsequent servicing of the hardware or software.

SMS writes log messages for significant hardware events to the platform log file located in /var/opt/SUNWSMS/adm/platform/messages .

The actions taken in response to events that crash domain software systems include automatic system recovery (ASR) reboots of all affected domain(s), provided that the domain hardware (or a bootable subset thereof) meets the requirements for safe and correct operation.

SMS logs all significant actions other than logging or updating user monitoring displays taken in response to an event. Log messages for significant domain software events and their response actions are written to the message log file for the affected domain located in /var/opt/SUNWSMS/adm/ domain_id/ messages .

SMS writes log messages to /var/opt/SUNWSMS/adm/ domain_id /messages for significant hardware events that can visibly affect one or more domains of the affected domain(s).

SMS also logs domain console, syslog, post and dump information and as well as manages sms_core files.

Log File Maintenance

SMS software maintains SC-resident copies of logs of all server information that it logs. Use the showlogs (1M) command to access log information.

The platform message log file can be accessed only by administrators for the platform using:

sc0:sms-user:> showlogs

SMS log information relevant to a configured domain can be accessed only by administrators for that domain. SMS maintains separate log files for each domain using:

sc0:sms-user:> showlogs -d domain_id|domain_tag  

SMS maintains copies of domain syslog files on the SC in /var/opt/SUNWSMS/adm/ domain_id /syslog . syslog information can be accessed only by administrators for that domain using:

sc0:sms-user:> showlogs -d domain_id|domain_tag  -s

Solaris console output logs are maintained to provide valuable insight into what happened before a domain crashed. Console output is available on the SC for a crashed domain in /var/opt/SUNWSMS/adm/ domain_id /console . console information can be accessed only by administrators for that domain using:

sc0:sms-user:> showlogs -d domain_id|domain_tag  -c

XIR state dumps, generated by the reset command, can be displayed using showxirstate . For more information refer to the showxirstate man page.

Domain post logs are for service diagnostic purposes and are not displayed by showlogs or any SMS CLI.

The /var/tmp/sms_core. daemon files are binaries and not viewable.

The availability of various log files on the SC supports analysis and correction of problems that prevent a domain or domains from booting. For more information refer to the showlogs man page.



Note Note - Panic dumps for panicked domains are available in the /var/crash logs on the domain and not on the SC.



The following table lists the SMS log information types, and their descriptions.

TABLE 9-1 SMS Log Type Information

Type

Description

Firmware Versioning

Unsuitable configuration of firmware version at firmware invocation is automatically corrected and logged.

Power On Self Test

LED fault; Platform and domain messages detailing why a fault LED was illuminated.

Power Control

All power operations are logged.

Power Control

Power operations that violate hardware requirements or hardware recommended procedures.

Power Control

Use of override to forcibly complete a power operation.

Domain Console

Automatic logging of console output to a standard file.

Hardware Configuration

Part numbers are used to identify board type in message logs.

Event Monitoring and Actions

All significant environmental events (those that require taking action).

Event Monitoring and Actions

All significant actions taken in response to environmental events.

Domain Event Monitoring and Actions

All significant domain software events and their response actions.

Event Monitoring and Actions

Significant hardware events written to the platform log.

Domain Event Monitoring and Actions

Significant hardware events that visibly affect one or more domains are written to the domain(s) log.

Domain Boot Initiation

Initiation of each boot and the passage through each significant stage of booting a domain is written to the domain log.

Domain Boot Failure

Boot failures are logged to the domain log.

Domain Boot Failures

All ASR recovery attempts are logged to the domain log.

Domain Panic

Domain panics are logged to the domain log.

Domain Panic

All ASR recovery attempts are logged to the domain log.

Domain Panic Hang

Each occurrence of a domain hang and its accompanying information is logged to the domain log.

Domain Panic

All ASR recovery attempts after a domain panic and hang are logged to the domain log.

Repeated Domain Panic

All ASR recovery attempts after repeated domain panics are logged to the domain message log.

Solaris OS Hang Events

All operating system hang events are logged to the domain message log.

Solaris OS Hang Events

All OS hang events result in a domain panic in order to obtain a core image for analysis of the Solaris hang. This information and subsequent recovery action is logged to the domain message log.

Solaris OS Hang Events

SMS monitors for the inability of the domain software to satisfy the request to panic. Upon determining noncompliance with the panic request, SMS aborts the domain and initiates an ASR reboot. All subsequent recovery action is logged to the domain message file.

Hot-Plug Events

All HPU insertion events of system boards to a domain are logged in the domain message log.

Hot-Unplug Events

All HPU removals are logged to the platform message log.

Hot-Unplug Events

All HPU removals from a domain are logged to the domain message log.

POST-initiated Configuration Events

All POST-initiated hardware configuration changes are logged in

/var/opt/SUNWSMS/adm/ domain_id /post .

Environmental Events

All sensor measurements outside of acceptable operational limits are logged as environmental events to the platform log file.

Environmental Events

All environmental events that affect one or more domains are logged to the domain message log.

Environmental Events

Significant actions taken in response to environmental events are logged to the platform message log.

Environmental Events

Significant actions taken in response to environmental events within a domain are logged to the domain message log.

Hardware Error Events

Hardware error and related information is logged to the platform message log.

Hardware Error Events

Hardware error and related information within a domain is logged to the domain message file.

Hardware Error Events

Log entries about hardware error for which data was collected include the name of the data file(s).

Hardware Error Events

All significant actions taken in response to hardware error events are logged to the platform message log.

Hardware Error Events

All significant actions taken in response to hardware error events affecting a domain(s) are logged to the domain(s) message log.

SC Failure Events

All SC hardware failure and related information is logged to the platform message log.

SC Failure Events

The occurrence of an SC failover event is logged to the platform message log.


Log File Management

SMS manages the log files, as necessary, to keep the SC disk utilization within acceptable limits.

The message log daemon ( mld ) monitors message log size, file count per directory, and age every 10 minutes. mld executes the first limit to be reached.

TABLE 9-2 MLD Default Settings

File Size (in Kb)

File Count

Days to Keep

platform messages

2500

10

0

domain messages

2500

10

0

domain console

2500

10

0

domain syslog

2500

10

0

domain post

20000*

1000

0

domain dump

20000*

1000

0

sms_core. daemon

50000

2

0

* total per directory not file


Assuming 20 directories, the defaults represent approximately 4Gbytes of stored logs.



caution icon

Caution Caution - The parameters show above are stored in /etc/opt/SUNWSMS/config/mld_tuning. mld must be stopped and restarted for any changes to take effect. Only an administrator experienced with system disk utilization should edit this file. Improperly changing the parameters in this file could flood the disk and hang or crash the SC.



For more information, refer to the mld and showlogs man pages, and see Message Logging Daemon .


Domain Reboot Events

SMS monitors domain software status (see Software Status ) to detect domain reboot events.

Domain Reboot Initiation

Since the domain software is incapable of rebooting itself, SMS software controls the initial sequence for all domain reboots. In consequence, SMS is always aware of domain reboot initiation events.

SMS software logs the initiation of each reboot and the passage through each significant stage of booting a domain to the domain-specific log file.

Domain Boot Failure

SMS software detects all domain reboot failures.

Upon detecting a domain reboot failure, SMS logs the reboot failure event to the domain-specific message log.

SC resident per-domain log files are available for failure analysis. In addition to the reboot failure logs, SMS can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC as described in Log File Maintenance .

Domain reboot failures are handled as follows:

SMS tries all ASR methods at its disposal to boot a domain that has failed booting. All recovery attempts are logged in the domain-specific message log.


Domain Panic Events

When a domain panics, it informs dsmd so that a recovery reboot can be initiated. The panic is reported as a domain software status change (see Software Status .

Domain Panic

dsmd is informed when the Solaris software on a domain panics.

Upon detecting a domain panic, dsmd logs the panic event including information, to the domain-specific message log.

SC resident per-domain log files are available to assist in domain panic analysis. In addition to the panic logs, SMS can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC as described in Log File Maintenance .

In general, after an initial panic where there has been no prior indication of hardware errors, SMS requests that a fast reboot be tried to bring up the domain. For more information, see Fast Boot .

After a panic event, dsmd tries the ASR reboot on the panicked domain. This recovery action is logged in the domain-specific message log.

Domain Panic Hang

The Solaris panic dump logic has been redesigned to minimize the possibility of hangs at panic time. In a panic situation, Solaris software may operate differently either because normal functions are shutdown or because it is disabled by the panic. An ASR reboot of a panicked Solaris domain is eventually started, even if the panicked domain hangs before it can request a reboot.

Since the normal heartbeat monitoring (see Solaris Software Hang Events ) of a panicked domain may not be appropriate or sufficient to detect situations where a panicked Solaris domain will not proceed to request an ASR reboot, dsmd takes special measures as necessary to detect a domain panic hang event.

Upon detecting a panic hang event, dsmd logs each occurrence including information, to the domain-specific message log.

Upon detection of a domain panic hang (if any), SMS aborts the domain panic (see Domain Abort/Reset ) and initiates an ASR reboot of the domain. dsmd logs these recovery actions in the domain-specific message log.

SC resident log files are available to assist in panic hang analysis. In addition to the panic hang event logs, dsmd maintains duplicates of important domain-resident logs and transcripts of domain console output on the SC as described in Log File Maintenance .

Repeated Domain Panic

If a second domain panic is detected shortly after recovering from a panic event, dsmd classifies the domain panic as a repeated domain panic event.

In addition to the standard logging actions that occur for any panic, the following action is taken when attempting to reboot after the repeated domain panic event.

With each successive repeated domain panic event, SMS attempts a full-test-level boot against the next untried administrator-specified degraded configuration (see ).

After all degraded configurations have been tried, successive repeated domain panic events will continue full-test-level boots using the last specified degraded configuration.

Upon determining that a repeated domain panic event has occurred, dsmd tries the ASR method at its disposal to boot a stable domain software environment. dsmd logs all recovery attempts in the domain-specific message log.


Solaris Software Hang Events

dsmd monitors the Solaris heartbeat described in Solaris Software Heartbeat in each domain while Solaris software is running (see ). When the heartbeat indicator is not updated for a period of time, a Solaris software hang event occurs.

dsmd detects Solaris software hangs.

Upon detecting a Solaris hang, dsmd logs the hang event including information, to the domain-specific message log.

Upon detecting a Solaris hang, dsmd requests the domain software to panic in order to obtain a core image for analysis of the Solaris hang ( Domain Abort/Reset ). SMS logs this recovery action in the domain-specific message log.

dsmd monitors the inability of the domain software to satisfy the request to panic. Upon determining noncompliance with the panic request, dsmd aborts the domain (see Domain Abort/Reset ) and initiates an ASR reboot. dsmd logs these recovery actions in the domain-specific message log.

Although the core image taken as a result of the panic will only be available for analysis from the domain, SC resident log files are available to assist in domain hang analysis. In addition to the Solaris hang event logs, dsmd can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC.


Hardware Configuration Events

Changes to the hardware configuration status are considered hardware configuration events. esmd detects the following hardware configuration events on the Sun Fire 15K system.

Hot-Plug Events

The insertion of a hot-pluggable unit (HPU) is a hot-plug event. The following actions take place:

Hot-Unplug Events

The removal of a hot-pluggable unit (HPU) is a hot-unplug event. The following actions take place:

Post-Initiated Configuration Events

POST can run against different server components at different times due to domain-related events such as reboots and dynamic reconfigurations. As described in Hardware Configuration , SMS includes status from POST and identifying failed-test components. Consequently, changes in POST status of a component are considered to be hardware configuration events. SMS logs POST-initiated hardware configuration changes to the platform message log.


Environmental Events

In general, environmental events are detected when hardware status measurements exceed normal operational limits. Acceptable operational limits depend upon the hardware and the server configuration.

esmd verifies that measurements returned by each sensor are within acceptable operational limits. esmd logs all sensor measurements outside of acceptable operational limits as environmental events to the platform log file.

esmd also logs significant actions taken in response to an environmental event (such as those beyond logging information or updating user displays) to the platform log file.

esmd logs significant environmental event response actions that affect one or more domain(s) to the log file(s) of the affected domain(s).

esmd handles environmental events by removing from operation the hardware that has experienced the event (and any other hardware dependent upon the disabled component). Hardware can be left in service, however, if continued operation of the hardware will not harm the hardware or cause hardware functional errors.

The options for handling environmental events are dependent upon the characteristics of the event. All events have a time frame during which the event must be handled. Some events kill the domain software; some do not. Event response actions are such that esmd responds within the event time frame.

There are a number of responses esmd can make to environmental events, such as increasing fan speeds. In response to a detected environmental event, which requires a power off, esmd undertakes one of the following corrective actions:

If the software is still running and a viable domain configuration remains after the affected hardware is removed, a remote DR operation to remove the hardware from the domain allows it to continue running in degraded mode.

If either of the last two options takes longer than the allotted time for the given environmental condition, esmd will immediately power off the component regardless of the state of the domain software.

SMS illuminates the fault indicator LED on any hot-pluggable unit that can be identified as the cause of an environmental event.

So long as the environmental event response actions do not include shutdown of the system controller(s), all domain(s) whose software operations were terminated by an environmental event or the ensuing response actions are subject to ASR reboot as soon as possible.

ASR reboot begins immediately if there is a bootable set of hardware that can be operated in accordance with constraints imposed by the Sun Fire 15K system to assure safe and correct operation.



Note Note - Loss of system controller operation (for example, by the requirement to power both SCs down) eliminates all possibility of Sun Fire 15K platform self-recovery actions being taken. In this situation, some recovery actions can require human intervention, so although an external monitoring agent may not be able to recover the Sun Fire 15K platform operation, that monitoring agent may serve an important role in notifying an administrator about the Sun Fire 15K platform shutdown.



The following provides a little more detail about each type of environmental event that can occur on the Sun Fire 15K system.

Over-Temperature Events

esmd monitors temperature measurements from Sun Fire 15K hardware for values that are too high. There is a critical temperature threshold that, if exceeded, is handled as quickly as possible by powering off the affected hardware. High, but not critical, temperatures are handled by attempting slower recovery actions.

Power Failure Events

There is very little opportunity to do anything when a full power failure occurs. The entire platform, domains as well as SCs, are shut off when the plug is pulled without the benefit of a graceful shut down. The ultimate recovery action occurs when power is restored (see Power-On Self-Test (POST) ).

Out-of-Range Voltage Events

Sun Fire 15K power voltages are monitored to detect out-of-range events. The handling of out-of-range voltages follows the general principles outlined at the beginning of Environmental Events .

Under-Power Events

In addition to checking for adequate power before powering on any boards, as mentioned in Power Control , the failure of a power supply could leave the server inadequately powered. The system is equipped with power supply redundancy in the event of failure. esmd does not take any action (other than logging) in response to a bulk power supply hardware failure. The handling of under power events follow the general principles outlined at the beginning of Environmental Events

Fan Failure Events

esmd monitors fans for continuing operation. Should a fan fail, a fan failure event occurs. The handling of fan-failures will follow the general principles outlined at the beginning of Environmental Events .


Hardware Error Events

As described in Hardware Error Status , the occurrence of Sun Fire 15K hardware errors is recognized at the SC by more than one mechanism. Of the errors that are directly visible to the SC, some are reported directly by PCI interrupt to the UltraSPARC IIi processor on the SC, and others are detected only through monitoring of the Sun Fire 15K hardware registers.

There are other hardware errors that are detected by the processors running in a domain. Domain software running in the domain detects the occurrence of those errors in the domain, which then reports the error to the SC. Like the mechanism by which the SC becomes aware of the occurrence of a hardware error, the error state retained by the hardware after a hardware error is dependent upon the specific error.

dsmd implements the mechanisms necessary to detect all SC-visible hardware errors.

dsmd implements domain software interfaces to accept reports of domain-detected hardware errors.

dsmd collects hardware error data and clears the error state.

dsmd logs the hardware error and related information as required, to the platform message log.

dsmd logs the hardware error to the domain message log file for all affected domain(s).

Data collected in response to a hardware error that is not suitable for inclusion in a log file may be saved in uniquely named file(s) in /var/opt/SUNWSMS/adm/ domain_id /dump on the SC.

SMS illuminates the fault indicator LED on any hot-pluggable unit that can be identified as the cause of a hardware error.

The actions taken in response to hardware errors (other than collecting and logging information as described above) are twofold. First, it may be possible to eliminate the further occurrence of certain types of hardware errors by eliminating from use the hardware identified to be at fault.

Second, all domains that crashed either as a result of a hardware error or were shut down as a consequence of the first type of action are subject to ASR reboot actions.



Note Note - Even in the absence of actions to remove from use hardware identified to be at fault, the ASR reboot actions are subject to full POST verification. POST will eliminate any hardware components that fail testing from the hardware configuration.



In response to each detected hardware error and each domain-software-reported hardware error, dsmd undertakes corrective actions.

ASR reboot with full POST verification will be initiated for each domain brought down by a hardware error or subsequent actions taken in response to that error.



Note Note - Problems with the ASR reboot of a domain after a hardware error are detected as domain boot failure events and subject to the recovery actions described in Domain Boot Failure.



dsmd logs all significant actions, such as those beyond logging information or updating user displays taken in response to a hardware error in the platform log file. When a hardware error affects one or more domains, dsmd logs the significant response actions in the message log files of the affected domain(s).

The following sections summarize the types of hardware errors expected to be detected/handled on the Sun Fire 15K system.

Domain Stop Events

Domain stops are uncorrectable hardware errors that immediately terminate the affected domain(s). Hardware state dumps are taken before dsmd initiates an ASR reboot of the affected domain(s). These files are located in: /var/opt/SUNWSMS/adm/ domain_id/ dump . dsmd logs the event in the domain log file.

CPU-Detected Events

A RED_state or Watchdog reset traps to low-level domain software (OpenBoot PROM or kadb ), which reports the error and requests initiation of ASR reboot of the domain.

An XIR signal (reset -x ) also traps to low-level domain software (OpenBoot PROM or kadb ), which retains control of the software. The domain must be rebooted manually.

Record Stop Events

Correctable data transmission errors (for example, parity errors) can stop the normal transaction history recording feature of Sun Fire 15K ASICs. SMS reports a transmission error as a record stop. SMS dumps the transaction history buffers of the Sun Fire 15K ASICs and re-enables transaction history recording when a record stop is handled. dsmd records record stops in the domain log file.

Other ASIC Failure Events

ASIC-detected hardware failures other than domain stop or record stop include console bus errors which may or may not impact a domain. The hardware itself will not abort any domain but the domain software may not survive the impact of the hardware failure and may panic or hang. dsmd logs the event in the domain log file.


SC Failure Events

SMS monitors the main SC hardware and running software status as well as the hardware and running software of the spare SC, if present. In a high-availability SC configuration, SMS handles failures of the hardware or software on the main SC or failures detected in the hardware control paths (for example, console bus, or internal network connections) to the main SC by an automatic SC failover process. This cedes main responsibilities to the spare SC and leaves the former main SC as a (possibly crippled) spare.

SMS monitors the hardware of the main and spare SCs for failures.

SMS logs the hardware failure and related information to the platform message log.

SMS illuminates the fault indicator LED on a system controller with an identified hardware failure.

For more information, see SC Failover .