C H A P T E R  10

SSP Internals

SSP operations are generally performed by a set of daemons and commands. This chapter provides an overview of how the SSP works and describes the SSP 3.5 daemons, processes, commands, and system files. For more information about daemons, commands, and system files, refer to the Sun Enterprise 10000 SSP 3.5 Reference Manual .


Startup Flow

The events that take place when the SSP boots are as follows:

  1. User powers on the SSP (monitor, CPU/disk, and CD-ROM). The SSP boots automatically.

  2. During the SSP boot process, the /etc/rc2.d/s99ssp startup script is called when the system enters run level 2. This script starts ssp_startup , which is responsible for starting other SSP daemons. If any of these SSP daemons die, ssp_startup restarts them.

  3. ssp_startup first initiates the following SSP daemons on both the main and spare SSP: machine_server , fad , and fod . The fod daemon determines the role of the SSP by first querying the fod daemon on the other SSP. If this query is not successful, fod will connect to the control board to determine the SSP role.

    If the SSP is the main, ssp_startup also initiates the following daemons: datasyncd , cbs , straps , snmpd , edd , and if domains are running, obp_helper and netcon_server . ssp_startup also calls cb_reset to start control board initialization. The control board server (CBS) connects to the primary control board, which is responsible for the JTAG interface.

    If the SSP is the spare, ssp_startup is complete.

    ssp_startup monitors the role of the SSP. If a role change is detected, ssp_startup initiates an SSP failover. After the failover, ssp_startup will configure the spare SSP as the new main SSP and initiate the daemons (listed above) needed for the new main SSP.

  4. When you get a message in the platform message file indicating that the startup of the SSP as the main or spare is complete, you can use SSP 3.5 commands such as domain_create (1M) or bringup (1M).


Sun Enterprise 10000 Client/Server Architecture

The Sun Enterprise 10000 system control board interface is accessed over an Ethernet connection using the TCP/IP protocol. The control board executive, CBE, runs on the control board. The control board server, cbs (1M), runs on the SSP and makes service requests. The SSP control board server provides services to SSP clients.

FIGURE 10-1 illustrates the Sun Enterprise 10000 system client/server architecture:

FIGURE 10-1 Sun Enterprise 10000 Client/Server Architecture



Note - There is one instance of edd(1M) for the platform supported by the SSP. There is one instance of obp_helper(1M) and netcon_server(1M) for each domain on the platform.




SSP Daemons

The SSP daemons play a central role on the SSP. The following table briefly describes these daemons.

TABLE 10-1 SSP Daemons

Name

Description

cbs

The control board server provides central access to the Sun Enterprise 10000 control board for client programs running on the SSP.

edd

The event detector daemon initiates event monitoring on the control boards. When a monitoring task detects an event, edd (1M) runs a response action script.

fad

The file access daemon provides distributed file access services to SSP clients that need to monitor, read, and write to the SSP configuration files.

fod

The failover daemon monitors SSP components (connections to the SSPs, control boards, and domains) and SSP resources for failure conditions that prevent the proper operation of the main SSP.

datasyncd

The data synchronization daemon propagates SSP configuration data and specified files from the main SSP to the spare. This synchronization keeps SSP data on the spare SSP current with the main SSP for failover purposes.

machine_server

The machine server daemon routes platform and domain messages to the proper messages file. See machine_server (1M).

netcon_server

The netcon server daemon is the connection point for all netcon (1M) clients. netcon_server (1M) is responsible for communication to the domains.

obp_helper

The OpenBoot PROM (OBP) helper daemon runs OpenBoot. obp_helper (1M) is responsible for providing services to OBP, such as NVRAM simulation, IDPROM simulation, and time of day.

snmpd

The SNMP proxy agent listens to a UDP port for incoming requests and also services the group of objects specified in
Ultra-Enterprise-10000.mib .

straps

The SNMP trap sink server listens to the SNMP trap port for incoming trap messages and forwards received messages to all connected clients.

xntpd / ntpd

The network time protocol (NTP) daemon provides time synchronization services. ntpd is used in the Solaris 2.6, Solaris 7, and Solaris 8, and Solaris 9 operating environments and replaces the xntpd daemon used in the Solaris 2.5.1 operating environment. For details on ntpd , see the Sun Enterprise 10000 SSP 3.5 Installation Guide and Release Notes and the xntpd (1M) man page.


Event Detector Daemon

The event detector daemon, edd (1M), is a key component in providing the reliability, availability, and serviceability (RAS) features of the Sun Enterprise 10000 system. edd (1M) initiates event monitoring on the Sun Enterprise 10000 control board, waits for an event to be generated by the event detection monitoring task running on the control board, and then responds to the event by executing a response action script on the SSP. The conditions that generate events and the response taken to events are fully configurable.

edd (1M) provides the mechanism for event management, but does not handle the event detection monitoring directly. Event detection is handled by an event monitoring task that runs on the control board. edd (1M) configures the event monitoring task by downloading a vector that specifies the event types to be monitored. Event handling is provided by response action scripts, which are invoked on the SSP by the edd (1M) when an event is received.

At SSP startup, edd (1M) obtains many of its initial control parameters from the following:

The RAS features are provided by several collaborative programs. The control board within the platform runs a control board executive (CBE) program that communicates through the Ethernet with a control board server daemon, cbs (1M), on the SSP. These two components provide the data link between the platform and the SSP.

The SSP provides a set of interfaces for accessing the control board through the control board server and the simple network management protocol (SNMP) agent. edd (1M) uses the control board server interface to configure the event detection monitoring task on the control board executive ( FIGURE 10-2 ).

FIGURE 10-2 Uploading Event Detection Scripts

After it is configured, the event detection monitoring task polls various conditions within the platform, including environmental conditions, signature blocks, power supply voltages, performance data, and so forth. If an event detection script detects a change of state that warrants an event, an event message containing the pertinent information is generated and delivered to the control board server, cbs (1M). Upon receipt of the event message, the control board server delivers the event to the SNMP agent, which in turn generates an SNMP trap ( FIGURE 10-3 ).

FIGURE 10-3 Event Recognition and Delivery

Upon receipt of an SNMP trap, edd (1M) determines whether to initiate a response action. If a response action is required, edd (1M) runs the appropriate response action script as a subprocess ( FIGURE 10-4 ).

FIGURE 10-4 Response Action

Event messages of the same type or related types can be generated while the response action script is running. Some of these secondary event messages may be meaningless or unnecessary if a responsive action script is already running for a similar event. For example, when edd (1M) runs a response action script for an overtemperature event, additional overtemperature events can be generated by the event monitoring scripts. edd (1M) does not respond to those overtemperature events (generated in response to the same overtemperature condition) until the first response script has finished. It is the responsibility of applications, such as edd (1M), to filter the events they will respond to as necessary. The cycle of event processing is completed at this point.

The edd (1M) response to a domain crash is another example of how edd (1M) responds to an event. After a domain crash, edd (1M) invokes the bringup (1M) script. The bringup (1M) script runs the POST program, which tests Sun Enterprise 10000 components. It then uses the obp_helper (1M) daemon to download and begin execution of OBP in the domain specified by the SUNW_HOSTNAME environment variable. This happens only if a domain fails (for example, after a kernel panic), in which case it is rebooted automatically. After a halt or shutdown, you must manually run bringup (1M), which then causes OBP to be downloaded and run.

Control Board Server

The control board server (CBS) runs on the SSP. Whenever a client program running on the SSP needs to access the Sun Enterprise 10000 system, the communication is funneled through cbs (1M). cbs (1M), in turn, communicates directly with a control board executive (CBE) running on the primary control board in the Sun Enterprise 10000 system. The primary control board provides the JTAG interface. cbs (1M) converts client requests to the control board management protocol (CBMP) that is understood by CBE. The following diagram shows how the CBS and CBEs are connected.

FIGURE 10-5 CBS Communication Between SSP and Sun Enterprise 10000 System

cbs (1M) relies on the cb_config (4) file to determine the platform it will manage, and the control board with which it will interact. Do not directly modify the cb_config (4) file; it is automatically maintained by domain management tools and commands.

File Access Daemon

The file access daemon, fad (1M), is responsible for providing distributed file access services, such as file locking, to all SSP clients that need to monitor, read, and write changes to SSP configuration files. Once a file is locked by a client, other clients are prevented from locking that file until the first client releases the lock.

Failover Daemon

The failover daemon, fod (1M), continuously monitors the following to detect a failure condition that prevents the proper operation of the main SSP:

This fod daemon runs on both the main and spare SSP. Depending on the type of failure condition detected, the fod daemon either initiates a control board failover, or it works with ssp_startup to initiate an SSP failover. The following section identifies the failover detection points and the conditions that initiate or disable a failover.

Failover Detection Points

The following figure illustrates the standard layout of a dual SSP and control board configuration required for automatic failover. The numbers identify points of failure that are detected by the fod daemon, and are summarized in TABLE 10-2 .

FIGURE 10-6 Automatic Failover Detection Points

The following table summarizes each failure condition and the resulting failover actions. For each failure point, refer to the detailed description of that failure point provided in the next section.

TABLE 10-2 Summary of Failover Detection Points and Actions

Failure Point

SSP Failover

SSP Failover Disabled

Control Board Failover

Control Board Failover Disabled

1 Main SSP to Domains

X

2 Spare SSP to Domains

X

3 Main SSP

X

4 Spare SSP

X

5 Main SSP to Spare Hub

X

X

6 Spare SSP to Main Hub

X

7 Main SSP to Main Hub

X

8 Spare SSP to Spare Hub

X

9 Main Hub

X

10 Spare Hub

X

X

11 Primary Control Board to Main Hub

X

12 Spare Control Board to Spare Hub

X

13 Primary Control Board

X

14 Spare Control Board

X


Description of Failover Detection Points

This section provides a detailed description of each failover detection point identified in TABLE 10-2 :

  1. Main SSP to Domains Failure

    The main SSP detects this failure of the public network interface on the main SSP to the domains and initiates an SSP failover.

    The public network interface failure is not fatal to the main SSP, but it affects dynamic reconfiguration (DR), Sun Enterprise Cluster, and Sun Management Center operations. This failure

    • Prevents DR operations from communicating with the DR daemons in the active domains

    • Restricts netcon sessions to the JTAG interface

    • Prevents the net booting of the SSP

    • Makes the CD-ROM inaccessible

    • Prevents the main SSP in a Sun Enterprise Cluster configuration from shutting down cluster nodes in a split-brain situation, which could allow a potential corruption of the cluster database

    • Prevents Sun Management Center from querying domains about their current state and configuration



    Note Note - The fod daemon monitors connections between the SSPs and the Sun Enterprise 10000 domains less frequently than the connections between the SSPs and and the control boards. If the main SSP cannot communicate with the domains, but the spare SSP can communicate with some or all of the domains, this failure condition must persist for 25 minutes before a failover is triggered. After 25 minutes, the fod daemon will initiate a failover, provided that the spare SSP can communicate with the primary control board and the spare SSP has sufficient memory and disk space.



  2. Spare SSP to Domain Failure

    The spare SSP detects this failure of the public network interface on the spare SSP to domains. This public interface failure does not cause a loss in critical SSP functionality, but it can affect dynamic reconfiguration, Sun Remote Services (SRS), Sun Management Center, and the Sun Cluster console.

    As a result, SSP failover is disabled.

  3. Main SSP Failure

    A failure in the main SSP can be caused by the following:

    • The depletion of SSP resources, such as virtual memory or disk space. The main SSP detects this failure and initiates a failover.

    • A system crash, which is detected by the spare SSP and the control boards. The spare SSP initiates the failover.

  4. Spare SSP Failure

    Both control boards and the main SSP detect this spare SSP failure. This failure disables SSP failover.

  5. Main SSP to Spare Hub Failure

    Both SSPs detect this failure of the control board network connection from the main SSP to the spare hub and spare control board. Both SSP and control board failover are disabled.

  6. Spare SSP to Main Hub Failure

    Both SSPs and the primary control board detect this failure of the control board network connection from the spare SSP to the main hub and primary control board.

    SSP failover is disabled because the spare SSP cannot monitor the SSP as required.

  7. Main SSP to Main Hub Failure

    Both SSPs and the primary control board detect this failure of the control board network connection from the main SSP to the main hub and primary control board. When connectivity from the spare SSP to the primary control board is verified, an SSP failover is attempted. If the SSP failover is unsuccessful, a control board failover occurs instead.

  8. Spare SSP to Spare Hub Failure

    Both SSPs and the spare control board detect this failure of the control board network connection from the spare SSP to the spare hub, and spare control board. SSP failover is disabled.

  9. Main Hub Failure

    Both SSPs and the primary control board detect this failure of the main hub and all connections to the primary control board. If connectivity to the domains exists and the domains are running, this failure causes a partial control board failover to the spare control board (JTAG failover only). If no domains are currently running, this failure causes a complete control board failover (JTAG and system clock failover).

    If a partial control board failover occurs, note that full control board functionality is retained, even though the JTAG interface and system clock are split between the primary and spare control boards.

  10. Spare Hub Failure

    Both SSPs and the spare control board detect this failure of the spare hub and all connections to the spare control board.

  11. Primary Control Board to Main Hub Failure

    Both SSPs and the primary control board detect this failure of the control board network connection from the main hub to the primary control board. If domains are running, this failure causes a partial control board failover (JTAG only) to the spare control board. If no domains are running, this failure causes a full control board failover.

    If a partial control board failover occurs, note that full control board functionality is retained, even though the JTAG interface and system clock are split between the primary and spare control boards.

  12. Spare Control Board to Spare Hub Failure

    Both SSPs and the spare control board detect this failure of the control board network connection from the spare hub to the spare control board. This failure disables the control board failover.

  13. Primary Control Board Failure

    Both SSPs detect this failure. If domains are running, this failure causes a partial control board failover (JTAG only) to the spare control board. If no domains are running, this failure causes a full control board failover.

    If a partial control board failover occurs, note that full control board functionality is retained, even though the JTAG interface and system clock are split between the primary and spare control boards.

  14. Spare Control Board Failure

    Both SSPs detect this failure, which disables a control board failover.

Data Synchronization Daemon

The data synchronization daemon, datasyncd (1M), propagates all SSP configuration information from the main to spare SSP. The datasyncd daemon uses a data propagation list that identifies the SSP and non-SSP files to be monitored and propagated. You use the setdatasync (1M) command to add non-SSP files to the data propagation list.

The datasyncd daemon runs on the main SSP and works with the fad daemon to monitor updates to SSP files on the main SSP. The datasyncd daemon then copies these updated files to the spare SSP, so that data on both SSPs is synchronized.

OpenBoot PROM

On the domain, OpenBoot PROM (OBP) is not a hardware PROM; it is loaded from a file on the SSP. An SSP file also replaces the traditional OBP NVRAM and idprom ( hostid ).

The OBP file is located under a directory path that is specific to the SunOS release. SunOS 5.6 corresponds to the Solaris 2.6 operating environment, SunOS 5.7 corresponds to the Solaris 7 operating environment, SunOS 5.8 corresponds to the Solaris 8 operating environment, and SunOS 5.9 corresponds to the Solaris 9 operating environment. You can determine your SunOS version with uname -r . For example, under SunOS 5.7, the OBP file is located in the following directory:

/opt/SUNWssp/release/Ultra-Enterprise-10000/5/7/hostobjs/obp

where the /5/7 portion of the path corresponds to the SunOS version number. If your release contains a different version of the operating system, that portion of the path will be different.

The primary task of OBP is to boot and configure the operating system from either a mass storage device or from a network. OBP also provides extensive features for testing hardware and software interactively. As part of the boot procedure, OBP probes all the SBus slots on all the system boards and builds a device tree. This device tree is passed on to the operating system. The device tree is ultimately visible using the command prtconf (for more information, see the SunOS prtconf (1M) man page).

OBP also interprets and runs FCode on SBus cards, which provides loadable, simple drivers for accomplishing boot. In addition, it provides a kernel debugger, which is always loaded.

The following sections describe how the obp_helper daemon and download_helper file control the OBP.

obp_helper Daemon

obp_helper (1M) is responsible for starting processors other than the boot processor. It communicates with OBP through bootbus SRAM(BBSRAM), responding to requests to supply the time-of-day, get or put the contents of the pseudo-EEPROM, and release slave processors when in multiprocessor mode. To release the slave processors, obp_helper (1M) must load download_helper into the BBSRAM of all the slave processors, place an indication in BBSRAM that it is a slave processor, then start the processor by releasing the bootbus controller reset.

The bringup (1M) command starts obp_helper (1M) in the background, which kills the previous obp_helper (1M), if one exists. obp_helper (1M) runs download_helper and subsequently downloads and runs OBP.

For more information, see the obp_helper (1M), and bringup (1M) man pages and download_helper File .

download_helper File

download_helper enables programs to be downloaded to the memory used by a domain instead of BBSRAM. This provides an environment in which host programs can run without having to know how to relocate themselves to memory. These programs can be larger than BBSRAM.

download_helper works by running a protocol through a mailbox in BBSRAM. The protocol has commands for allocating and mapping physical to virtual memory, and for moving data between a buffer in BBSRAM and virtual memory. When complete, the thread of execution is usually passed to the new program at an entry point provided by the SSP. After this occurs, download_helper lives on in BBSRAM so it can provide reset handling services. Normally, you do not need to be concerned with the download helper; it is used only by the obp_helper (1M) daemon. See the obp_helper (1M) man page for more information.


POST

Power-on self-test (POST) probes and tests the components of uninitialized Sun Enterpirse 10000 system hardware, configures what it deems worthwhile into a coherent initialized system, and hands it off to OpenBoot PROM (OBP). POST passes to OBP a list of only those components that have been successfully tested; those in the blacklist (4) file are excluded.

hpost (1M) is the SSP-resident executable program that controls and sequences the operations of POST. hpost (1M) reads directives in the optional file . postrc (see postrc (4)) before it begins operation with the host.



caution icon

Caution Caution - Running hpost(1M) outside of the bringup(1M) command can cause the system to fail. hpost(1M), when run by itself, does not check the state of the platform, and causes fatal resets.



POST looks at blacklist (4), which is on the SSP, before preparing the system for booting. blacklist (4) specifies the Sun Enterprise 10000 components that POST must not configure.

POST stores the results of its tests in an internal data structure called a board descriptor array . The board descriptor array contains status information for most of the major components of the Sun Enterprise 10000 system, including information about the UltraSPARC trademark modules.

POST attempts to connect and disconnect each system board, one at a time, to the system centerplane. POST then connects all the system boards that passed the tests to the system centerplane.


Environment Variables

Most of the necessary environment variables are set when the ssp user logs in. describes the environment variables.



Note Note - Do not change the values for the following environment variables, except for SUNW_HOSTNAME.



TABLE 10-3 Environment Variables

SUNW_HOSTNAME

Name of the domain controlled by the SSP. You set this variable to the host name of the domain on which you are performing operations.

SSPETC

Path to the directory containing miscellaneous SSP-related files.

SSPLOGGER

Path to the directory containing the platform logs and directories for domain logs.

SSPOPT

Path to the SSP package binaries, libraries, and object files.

SSPVAR

Path to the directory where modifiable files reside.