C H A P T E R  9

Dual Control Board Handling

A platform can be configured with dual control boards for redundancy purposes. One of the control boards is identified as the primary control board and the other control board is considered the spare. The switchover from the primary control board to the spare when a failure occurs is called control board failover. This failover is done automatically. If necessary, you can also force a control board failover.

This chapter explains how control boards function in a dual configuration and how control board failover works.



Note - You can have dual control boards in a single SSP configuration, as well as in a dual SSP configuration (main and spare SSP). Control board failover works the same in either a single or dual SSP configuration.




Control Board Executive

The control board executive (CBE) runs on the control board and facilitates communication between the SSP and the platform.

When power is applied, both control boards boot from the main SSP. After the CBE is booted, it waits for the control board server and the fod (failover) daemon running on the SSP to establish a connection. The connections between the fod daemon and the control board facilitate SSP and control board failover.

A failover task within CBE enables the main and spare SSP to establish connections for monitoring failover conditions. This task listens for and accepts TCP/IP connections from the fod daemons running on the main and spare SSP. The failover task also reads and transmits heartbeat messages to the fod daemons on both the main and spare SSP.


Primary Control Board

When the control board server running on the SSP connects to the CBE running on a control board, the CBE asserts the control board as the primary control board. The primary control board is responsible for the JTAG interface, which enables control board components to communicate with other Sun Enterprise 10000 system components so that the Sun Enterprise 10000 system can be monitored and configured. The primary control board also provides the system clock, which synchronizes and controls the speed of the centerplane, CPU clock, and system boards.


Control Board Server

After the SSP is booted, the control board server (CBS) is started automatically, as are several other daemons, including the fod daemon. The CBS is responsible for all nonfailover communication between the SSP and the primary control board.

The CBS attempts to connect only to the primary control board identified in the control board configuration file.



Note - Do not manually modify the control board configuration file. Use the ssp_config(1M) command to change the control board configuration.



The format of the control board configuration file is as follows:

platform_name:platform_type:cb0_hostname:status0:cb1_hostname:status1

where:

platform_name is the name assigned by the system administrator.

platform_type is Ultra-Enterprise-10000 .

cb0_hostname is the host name for control board 0, if available.

status0 indicates that control board 0 is the primary control board ( P indicates primary, and anything else indicates non-primary).

cb1_hostname is the host name for control board 1, if available.

status1 indicates that control board 1 is the primary control board.

For example:

xf2:Ultra-Enterprise-10000:xf2-cb0:P:xf2-cb1:

This example indicates that there are two control boards in the xf2 platform. They are xf2-cb0 and xf2-cb1 . xf2-cb0 is specified as the primary. See the cb_config (4) man page for more information.

The communication port that is used for communication between the control board server and the control board executive is specified in
/tftpboot/
xxxxxxxx .cb_port where xxxxxxxx is the control board IP address represented in hexadecimal format.

Control Board Executive Image and Port Specification Files

The main SSP is the boot server for the control board. Two files are downloaded by the control board boot PROM during boot time: the image of CBE and the port number specification file. These files are located in /tftpboot on the SSP and the naming conventions are:

/tftpboot/xxxxxxxx for the cbe image
/tftpboot/xxxxxxxx.cb_port for the port number

where xxxxxxxx is the control board IP address in hex format.

For example, if the IP address of xf2-cb0 is 129.153.3.19, the files for control board xf2-cb0 are:

/tftpboot/81990313
/tftpboot/81990313.cb_port


Automatic Failover to the Spare Control Board

Control board failover is automatically enabled upon SSP installation or upgrade. The fod daemon performs failover monitoring of the control boards and other failover components. If the primary control board is not functioning properly, the fod daemon will trigger an automatic failover to the spare control board. A control board failure can be caused by

Note that under certain failure conditions the fod daemon can disable a control board failover. For a detailed description of the failure conditions and a summary of the failover actions performed, see Chapter 10 .

A control board failover can be either partial or complete, depending on whether domains are running:

Managing Control Board Failover

You can enable, disable, or force a control board failover as explained in the following procedures. Use the setfailover (1M) command on the main SSP to manage the failover state. For example, after a control board failover occurs, you must use the setfailover (1M) command to re-enable the control board failover capability.


procedure icon  To Disable Control Board Failover

single-step bullet As user ssp on the main SSP, type:

ssp% setfailover -t cb off

Control board failover remains disabled until you enable it. To determine whether control board failover was disabled, use the showfailover (1M) command to verify the failover state, as explained in Obtaining Control Board Failover Information .


procedure icon  To Enable Control Board Failover

single-step bullet As user ssp on the main SSP, type:

ssp% setfailover -t cb on

Control board failover is activated when all the connection links are functioning properly. If any failed connections exist, control board failover is not enabled. You can use the showfailover (1M) command to verify that control board failover is enabled and review the connection status.


procedure icon  To Force a Complete Control Board Failover



Note Note - If you want to force a complete control board failover, where both the JTAG connection and the system clock source are moved from the primary control board to the spare, you must shut down any domains that are running and power off, then power on all system boards before you switch control boards. If you do not shut down all the domains, a partial control board failover occurs. The JTAG connection is moved to the spare control board but the system clock source remains on the former primary control board.



1. If any domains are running, shut down those domains using the standard shutdown (1M) command.

2. Log in to the main SSP as user ssp .

3. To ensure that domains do not arbstop, do the following:

    a. Stop event detection monitoring.

    ssp% edd_cmd -x stop

    b. Power off all of the system boards.

    ssp% power -off -all

    c. Power on all of the system boards.

    ssp% power -on -all

    d. Start event detection monitoring.

    ssp% edd_cmd -x start

4. Type the following to force the control board failover:

ssp% setfailover -t cb force

5. Issue the bringup (1M) command for all domains.

6. Re-enable control board failover as described in To Enable Control Board Failover .

Obtaining Control Board Failover Information

Use the showfailover (1M) command on the main SSP to obtain the failover state of an SSP or control board failover and the status of the private connection links. The names of the SSPs and control boards are also provided, and the control boards responsible for the JTAG interface and system clock are identified. For details on the failover information displayed, see Obtaining Failover Status Information .

The following example shows the information displayed for a control board failover in which the primary control board failed.

ssp% showfailover  
Failover State:
     SSP Failover: Active
     CB Failover:  Failed
Failover Connection Map:
     Main SSP to Spare SSP thru Main Hub:   GOOD
     Main SSP to Spare SSP thru Spare Hub:  GOOD
     Main SSP to Primary Control Board:     FAILED
     Main SSP to Spare Control Board:       GOOD
     Spare SSP to Main SSP thru Main Hub:   GOOD
     Spare SSP to Main SSP thru Spare Hub:  GOOD
     Spare SSP to Primary Control Board:    FAILED
     Spare SSP to Spare Control Board:      GOOD
SSP/CB Host Information
     Main SSP:                              xf12-ssp
     Spare SSP:                             xf12-ssp2
     Primary Control Board (JTAG source):   xf12-cb1
     Spare Control Board:                   xf12-cb0
     System Clock source:                   xf12-cb1

You can also use Hostview to verify the type of control board failover (complete or partial). When you use Hostview to verify a control board, the "J" (JTAG) and "C" (system clock source) characters indicate which control board manages the JTAG interface and system clock.

FIGURE 9-1 shows an example Hostview window after a partial control board failover. One control board handles the JTAG interface, while the other serves as the system clock source.

FIGURE 9-1 Example Hostview Window After a Partial Control Board Failover

After Control Board Failover

After a control board failover occurs, you must perform certain recovery tasks: