C H A P T E R  8

SSP Failover

SSP provides an automatic failover capability that switches the main SSP to the spare within several minutes of detecting a failover condition, without operator intervention. A failover condition is a point of failure that occurs between the main and spare SSP, their control boards, or their network connections. The automatic failover mechanism continuously monitors both SSPs and their related components to detect a failover condition.

This chapter explains



Note - You can have SSP failover, control board failover, or both. For information on automatic failover for control boards, see Chapter 9. For details on how the SSP, control board, and hub components must be configured for the various types of failover (SSP failover, control board failover, or both), refer to the Sun Enterprise 10000 SSP 3.5 Installation Guide and Release Notes.




Required Main and Spare SSP Architecture

For automatic SSP and control board failover to function properly, you must set up your dual SSP configuration as illustrated in the following figure.

FIGURE 8-1 Dual SSP Configuration Required for Automatic Failover

FIGURE 8-1 shows the SSP, control board, and hub configuration required for dual SSP and control board failover (two SSPs, two hubs, and two control boards). Refer to the Sun Enterprise 10000 SSP 3.5 Installation Guide and Release Notes for details on the other configurations (for example, you can have a single SSP configuration with two control boards) supported by the failover feature and the prerequisites for implementing automatic failover.


Maintaining a Dual SSP Configuration

To maintain a dual SSP configuration for failover purposes, note the following:


Maintaining a Single SSP Configuration

In single and dual SSP configurations, the SSP configuration files are copied to the
/tmp directory for data synchronization purposes. (For information on data synchronization, see Managing Data Synchronization .) However, for single SSP configurations it is suggested that you run the setdatasync clean command on a regular basis to reduce the number of SSP message and log files that accumulate in the /tmp directory. For additional details on using the setdatasync clean (1M) command, see To Remove the Data Propagation List and the setdatasync (1M) man page.


How Automatic Failover Works

Automatic failover of the main to the spare SSP is accomplished through the following:

The following sections provide an overview of the basic SSP failover situations and the various ways to control automatic failover.

SSP Failover Situations

An automatic failover is triggered when a failure in the dual SSP configuration affects the proper operation of the main SSP. Failure points can be caused by the following:

However, note that failover will not occur when it has been disabled by operator request or when certain failure conditions prevent the failover. The various failure conditions and the resulting failover actions are summarized in Chapter 10 , which identifies and explains the different points of failure detected by the failover process.

SSP Failover State Changes

After a failover occurs, you can obtain failover status information by running the showfailover (1M) command on the working SSP. For details, see Obtaining Failover Status Information . Note that the failover status information displayed reflects the failover state at the time you run the showfailover command.

The following state changes occur after an SSP failover:

Controlling Automatic SSP Failover

The SSP failover capability is automatically enabled upon SSP installation or upgrade. You control the failover state through the setfailover (1M) command, which enables you to do the following:

For additional information, see the setfailover (1M) man page.


procedure icon  To Disable SSP Failover

1. As user ssp on the main SSP, type:

ssp% setfailover off

SSP failover remains disabled it until you enable it, as explained in the next procedure.



Note Note - If you reboot both the main and spare SSP, failover is automatically re-enabled.



2. Run the showfailover (1M) command to verify that failover was disabled.

For details, see Obtaining Failover Status Information . The failover state should be listed as Disabled .


procedure icon  To Enable SSP Failover

When you use the setfailover (1M) command to enable failover after it has been disabled, the connection states are checked before failover is enabled. All connection links must be functioning properly before failover can be enabled. If any failed connections exist, failover is not enabled.

1. As user ssp on the main SSP, type:

ssp% setfailover on

SSP failover is enabled if both SSPs and all their connection links are working.

2. Run the showfailover (1M) command to verify that failover was enabled.

For details on reviewing the failover state and connection status, see Obtaining Failover Status Information .



Note Note - Wait several minutes before verifying the failover state. During this time, the setfailover command checks the control board connections before activating SSP failover.




procedure icon  To Force a Failover to the Spare SSP



Note Note - Before forcing an SSP failover, be sure that both the main and spare SSP are synchronized. Use the showdatasync(1M) command to review the status of data synchronization between the main and spare SSP. For details, see Obtaining Data Synchronization Information.



1. As user ssp on the main SSP, type:

ssp% setfailover force

The setfailover command checks the data synchronization state before forcing a failover. The forced failover will not occur if any of the following conditions exist:

You can run the showdatasync (1M) command to obtain information on the synchronization state.

2. Run the showfailover (1M) command to verify that the forced failover occurred and review the failover state and connection status.

For details, see Obtaining Failover Status Information .

3. Re-enable SSP failover, as explained in To Enable SSP Failover .


procedure icon  To Modify the Memory or Disk Space Threshold in the ssp_resource File

When memory or disk space resources drop below a certain threshold, a failover occurs. However, you can change the threshold for these resources, which are stored in the ssp_resource (4) file, by using the setfailover (1M) command.

1. As user ssp on the main SSP, do one of the following:

2. Verify the updated threshold value by using the setfailover (1M) command with only the -m or -d option.

Obtaining Failover Status Information

Use the showfailover (1M) command on the main SSP to display failover status information. The following example shows the failover information displayed.

ssp% showfailover  
Failover State:
     SSP Failover: Disabled
     CB Failover:  Active
Failover Connection Map:
     Main SSP to Spare SSP thru Main Hub:   FAILED
     Main SSP to Spare SSP thru Spare Hub:  FAILED
     Main SSP to Primary Control Board:     GOOD
     Main SSP to Spare Control Board:       GOOD
     Spare SSP to Main SSP thru Main Hub:   FAILED
     Spare SSP to Main SSP thru Spare Hub:  FAILED
     Spare SSP to Primary Control Board:    FAILED
     Spare SSP to Spare Control Board:      FAILED
SSP/CB Host Information
     Main SSP:                              xf12-ssp
     Spare SSP:                             xf12-ssp2
     Primary Control Board (JTAG source):   xf12-cb1
     Spare Control Board:                   xf12-cb0
     System Clock source:                   xf12-cb1

The failover status includes the

You can also obtain information about the role of the current SSP by specifying the showfailover (1M) command with the -r option. The SSP role is either UNKNOWN (SSP role has not been determined), MAIN, or SPARE.

For additional details on the showfailover (1M) command, see the showfailover (1M) man page.

Managing Data Synchronization

The data synchronization process copies any changes to the SSP configuration or specified user files on the main SSP to the spare SSP. As part of this process, the files to be copied are listed in a data synchronization queue so that you can see which files will be copied from the main to the spare SSP. You can use the showdatasync (1M) command to see which files are in the queue.

If you have user-created files (non-SSP files that are not contained in the SSP directories) that must be maintained on the spare SSP for failover purposes, you must identify these files in a data propagation list
( /var/opt/SUNWssp/.ssp_private/user_file_list ). The datasyncd daemon uses this list to determine which files to copy from the main SSP to the spare.

By default, the data synchronization process checks for any changes to the user-created files on the main SSP every 60 minutes. You can use the setdatasync command to set the interval at which the data propagation list is to be checked for modifications (see To Add a File to the Data Propagation List ). The interval starts from the time at which a file is added to the data propagation list. The files in this list are propagated to the spare SSP only when they have changed from the last interval check.



Note Note - The data synchronization daemon uses the available disk space in the /tmp directory to copy files from the main SSP to the spare. If you have files to be copied that are larger than the /tmp directory, those files cannot be propagated. For example, if the data synchronization backup file (ds_backup.cpio) file gets larger than the available space in /tmp, you must reduce the size of this file before data propagation can occur. For details on reducing the size of the data synchronization backup file, see To Reduce the Size of the Data Synchronization Backup File.



Use the setdatasync (1M) command to do the following:



Note Note - The files on the spare SSP are not monitored by the datasyncd daemon, which means that if you remove a user-created file on the spare SSP, the user file will not be automatically restored (copied) from the main to the spare SSP. In addition, do not remove SSP configuration files from the spare SSP.



For additional details, see the setdatasync (1M) man page.


procedure icon  To Add a File to the Data Propagation List

single-step bullet As user ssp on the main SSP, type:

ssp% setdatasync -i interval schedule filename 

where interval indicates the frequency (number of minutes) that the specified filename is to be checked as part of the data synchronization process. The specified file name must contain the absolute path. The files on the data propagation list are copied to the spare SSP only when those files change on the main SSP, and not each time the files are checked.


procedure icon  To Remove a File From the Data Propagation List

single-step bullet As user ssp on the main SSP, type:

ssp% setdatasync cancel filename 

where filename is the file to be removed from the data propagation list. The file name must contain the absolute path.


procedure icon  To Remove the Data Propagation List

The setdatasync clean command is useful for managing disk space in single SSP configurations, where the data propagation list can grow quite large and consume unnecessary disk space. It is possible for the /tmp directory to become full, which can cause the system to hang. You can run the setdatasync clean command as needed, either daily or weekly to prevent the /tmp directory from growing too large. Or, you can automate the cleanup by using the cron (1M) command with a crontab (1M) entry that uses the setdatasync clean command.



Note Note - Do not use this option when you have a dual SSP configuration because it can desynchronize data between the main and spare SSP.



single-step bullet As user ssp on the main SSP, type:

ssp% setdatasync clean  


procedure icon  To Push a File to the Spare SSP

single-step bullet As user ssp on the main SSP, type:

ssp% setdatasync push filename 

where filename is the file to be moved to the spare SSP without adding the file to the data propagation list. The file name must contain the absolute path.


procedure icon  To Synchronize SSP Configuration Files Between the Main and the Spare SSP

Use this procedure to keep data between the main and spare SSP synchronized, for example, after SSP failover has been disabled then re-enabled. If you want to archive an SSP configuration, use the ssp_backup (1M) command.

single-step bullet As user ssp on the main SSP, type:

ssp% setdatasync backup  

A data synchronization backup file ( /tmp/ds_backup.cpio ) of all SSP configuration data on the main SSP is created and then restored on the spare SSP. Note that the data synchronization backup differs from a backup created by the ssp_backup (1M) command:

The data synchronization backup can fail if the backup file exceeds the available disk space in the /tmp directory. For details on reducing the size of the data synchronization backup file, see the following procedure.


procedure icon  To Reduce the Size of the Data Synchronization Backup File

1. As superuser on the main SSP, run ssp_backup (1M) to create an archive of your SSP environment.

2. Remove the following files to reduce the size of the data synchronization backup created before you run setdatasync backup :

where x is the archive number of the file. Because these files are propagated from the new main SSP to the spare after a failover, you must remove these files on both the main and spare SSP to prevent regeneration of these files.

Obtaining Data Synchronization Information

Use the showdatasync (1M) command on the main SSP to obtain basic status information about data synchronization. The examples in this section show the different types of information displayed by the showdatasync command. For additional details, see the showdatasync (1M) man page.

The next example shows the file propagation status of the data synchronization process, the file currently propagated (none), and the number of files queued for data propagation (none). In this case, the status ACTIVE ARCHIVE indicates that a data synchronization backup is being performed.

ssp% showdatasync 
File Propagation Status:  ACTIVE ARCHIVE
Active File:              -
Queued files:             0

The following example shows the file propagation status of the data synchronization process, the name of the file currently being propagated, and the number of files queued for data propagation (none). In this case, the status ACTIVE indicates that the data synchronization process is enabled and functioning normally. The data synchronization backup file is the active file currently propagated.

ssp% showdatasync 
File Propagation Status:  ACTIVE
Active File:              /tmp/ds_backup.cpio
Queued files:             0

The next example shows a data propagation list. Note that the INTERVAL indicates the frequency, in minutes, at which the file is to be checked for changes, as part of the data synchronization process.

ssp% showdatasync -l  
TIME PROPAGATED         INTERVAL     FILE
Mar 23 16:00:00         60           /tmp/t1
Mar 23 17:00:00         120          /tmp/t2

The example below shows the files queued for data synchronization:

ssp% showdatasync -Q  
FILE
/tmp/t1
/tmp/t2

Performing Command Synchronization

Command synchronization recovers user-defined commands that are interrupted by a failover and automatically reruns those commands on the new main SSP after a failover. Command synchronization does the following:

If you want user commands to be automatically recovered after a failover, you must prepare these user commands for synchronization as explained in the following sections.

Preparing User Commands for Automatic Restart

The runcmdsync (1M) command prepares a user command for automatic restart. runcmdsync adds the user command to the command synchronization list, which identifies the commands to be rerun after a failover.


procedure icon  To Prepare a User Command for Restart

single-step bullet As user ssp on the main SSP, type:

ssp% runcmdsync script_name [parameters] 

where:

script_name is the name of the user command to be restarted.

parameters are the options associated with the specified command.

The specified command will be rerun automatically on the new main SSP after a failover.

Preparing User Scripts for Automatic Recovery

If you want to resume processing of a user script from a certain marked point (location) within the script, you must include the following synchronization commands in the user script:

Each script must contain the initcmdsync and cancelcmdsync commands to initialize the script for synchronization and then remove the command from the command synchronization list respectively. For details on the synchronization commands, see the cmdsync (1M) man page.



Note Note - These synchronization commands are intended for use by experienced programmers. You can use the runcmdsync(1M) command instead of the synchronization commands described in this section to prepare a script for recovery. However, the runcmdsync(1M) command will prepare the script so that it is rerun from the beginning and not from specified marker points.



The following procedures describe how to use these synchronization commands.



Note Note - After an SSP failover or in a single SSP configuration, SSP failover is disabled. When failover is disabled, scripts that contain synchronization commands will generate error messages to the platform log file and return non-zero exit codes. These error messages can be ignored.




procedure icon  To Create a Command Synchronization Descriptor

1. In your user script, type the following to create a command synchronization descriptor that identifies your script:

initcmdsync script_name [parameters]

where:

script_name is the name of the script.

parameters are the options associated with the specified script.

The output returned from the initcmdsync command serves as the command synchronization descriptor.


procedure icon  To Specify a Command Synchronization Marker Point

1. In your user script, type the following to mark an execution point from which processing can be resumed:

savecmdsync -M identifier cmdsync_descriptor 

where:

identifier is a positive integer that marks an execution point from which the script can be restarted.

cmdsync_descriptor is the command synchronization descriptor output by the initcmdsync command.


procedure icon  To Remove a Command Synchronization Descriptor

1. In your user script, type the following after the script termination sequence:

cancelcmdsync cmdsync_descriptor

where cmdsync_descriptor is the command synchronization descriptor output by the initcmdsync command. The specified descriptor is removed from the command synchronization list so that the user script is not run on the new main SSP after a failover.

Obtaining Command Synchronization Information

Use the showcmdsync (1M) command on the main SSP to review the command synchronization list that identifies the user commands to be restarted on the new main SSP after an automatic failover.

The following is an example command synchronization list output by the showcmdsync (1M) command:

ssp% showcmdsync 
DESCRIPTOR      IDENTIFIER   CMD
         0              -1   c1 c2 a2

For further details, see the showcmdsync (1M) man page.

Example Script with Synchronization Commands

SSP provides an example user script that shows how the synchronization commands can be used. This script is located in the /opt/SUNWssp/examples/cmdsync directory. This directory also contains a README file that explains how the script works.

Recovering After an SSP Failover

After an SSP failover occurs, you must perform certain recovery tasks: