Netra High Availability Suite 3.0 1/08 Release Notes

The Netra High Availability Suite 3.0 1/08 Release Notes contain important and late-breaking information about the current release of the Netratrademark High Availability (HA) Suite Foundation Service software. These notes contain known restrictions and workarounds to known bugs. In cases where there are differences between the release notes and the Netra HA Suite 3.0 1/08 documentation set, the information in the release notes takes precedence.

The Netra HA Suite 3.0 1/08 release is a set of Netra HA Suite patches to be applied to the Netra HA Suite 3.0 first customer shipment (FCS) release. See sections on specific types of patches presented later in these notes for a complete list of the patches to be applied.

This document contains the following sections:


What’s New Since 3.0

The following new functionalities have been introduced since the release of Netra HA Suite 3.0 software:

Solaris 10 1/06 OS Support on SPARC and x64 Platforms

The Netra HA Suite 3.0 1/08 software is supported for use with the Solaristrademark 10 1/06 OS on SPARC® and x64 platforms only for platforms that are already supported for use with the Netra HA Suite 3.0 FCS release. For more information about the platforms supported for use with the Netra HA Suite 3.0 FCS release, see TABLE 4.

To support the Netra HA Suite 3.0 1/08 release on the Solaris 10 1/06 OS release, you must install the Solaris patches documented in Solaris OS Patches.

Solaris 10 8/07 OS Support on SPARC (Including CMT) and x64 Platforms

The Netra HA Suite 3.0 1/08 software is supported for use with the Solaris 10 8/07 OS on SPARC (including chip multithreading [CMT]) and x64 platforms. For more information, refer to TABLE 4. To support the Netra HA Suite 3.0 1/08 release on the Solaris 10 8/07 OS release, you must install the Solaris patches documented in Solaris OS Patches.

MontaVista Linux Carrier Grade Edition 4.0 Support on Netra CP3020 Servers

The Netra HA Suite 3.0 1/08 software is supported for use with the MontaVista Linux Carrier Grade Edition 4.0 OS (MV CGE 4.0) on only the Netra CT 900 servers equipped with Netra CP3020 blades. This is a 64-bit Linux distribution.

For more information, refer to TABLE 4.

WindRiver CGL Support on Netra CP3020/CP3220 Blades, Netra X4200 Server

The Netra HA Suite 3.0 1/08 software is supported for use with the WindRiver CGL OS on the Netra CT900 blade server with Netra CP3020 or CP3220 blades, as well as on the Netra X4200 rack-mounted server. TABLE 1 describes the supported platforms and 64-bit capabilities available with each supported version of the PNE-LE bundle release.


TABLE 1 WindRiver CGL Support

PNE-LE Bundle Release Version

Supported Platforms

64-Bit Capabilities

1.4

Netra CP3020 blades in a Netra CT900 blade server

Netra X4200 rack-mounted server

64-bit Linux kernel, but only 32-bit user land libraries

2.0

Netra CP3020/CP3220 blades in a Netra CT900 blade server

Netra X4200 rack-mounted server

Full 64-bit release


For PNE-LE bundle release 1.4, a patch is delivered for use with Netra HA Suite 3.0 1/08. If you want to run WindRiver CGL PNE-LE bundle release 2.0, contact your service representative. For more information, refer to TABLE 4.

Reduced Global Failover Time

The default detection delay for the heartbeat mechanism has been reduced on the Solaris OS from 900 milliseconds to 150 milliseconds, reducing the global failover time. On the Wind River Linux OS, the default value has been reduced to 300 milliseconds.

To change the default value, add Probe.DetectionDelay=value to the nhfs.conf file and reboot the node. Note that CMM.Probe.DetectionDelay is the deprecated name for Probe.DetectionDelay and can still be used.

Note that decreasing the value of Probe.DetectionDelay below 150 ms on Solaris OS and below 300 ms on Wind River Linux OS might lead to an unexpected loss of heartbeats and, as a result, nodes might unexpectedly leave the cluster. To avoid this situation, do not set Probe.DetectionDelay below the default values.

To return Probe.DetectionDelay to its default value, set the value to 900.

Synchronizing External Address Manager With Reliable Network File System on Switchover

If you use external addresses managed by External Address Manager (EAM) to access the Reliable Network File System (RNFS) from client nodes that are outside of the cluster, specify that the services should be synchronized when a switchover occurs. To enable this synchronization, set the EAM.SyncWithRNFS property in the nhfs.conf file to True. For information about this property, refer to the Netra High Availability Suite 3.0 1/08 Foundation Services Reference Manual.

New CMM Property Determines Whether Nodes Join Cluster at Startup

A new CMM property, CMM.StartUp.Join, allows you to define whether nodes will automatically try to join the cluster at startup. If the property is set to False, the node will not join the cluster at boot time. In this case, the node will join the cluster only upon request of the application through a CMM command. For information about this property, see the nhfs.conf(4) man page for Solaris or the nhfs.conf(5) man page for Linux, or refer to the Netra HA Suite reference manual for your operating system.

Sun Connection Inventory Support

The Netra HA Suite enables a service tag that can be automatically discovered and identified by the Suntrademark Connection Inventory channel. For details about using Sun Connection’s Inventory channel to track and organize your Sun software and hardware, refer to:

https://sunconnection.sun.com/inventory

LDoms Support

Sun’s Logical Domains (LDoms) technology is a server virtualization and partitioning technology that enables the allocation of various system resources, such as memory, CPUs, I/O, and storage into partitions known as logical or virtual domains. Each logical domain can have an independent operating system, resources, and identity within a single computer system. Specialized service and control domains allow these resources to be managed using the Logical Domains Manager software.

For information about the LDoms configurations supported with this release of the Netra HA Suite Foundation Services, refer to LDoms.


Limitations of Supported Configurations

The following limitations apply to the configurations supported in this release of the Netra HA Suite software.

LDoms

Netra HA Suite 3.0 1/08 software is supported for use with LDoms 1.0.1 on Netra CP3060 and CP3260 ATCA blades (CMT) and Netra T2000 and T5220 servers (CMT). LDoms functionality is supported only on the Solaris 10 8/07 OS or newer.

The Netra HA Suite Foundation Services should be installed only in guest domains.

Netra CP3060 ATCA blades support only one physical disk drive, and this disk is owned by the control domain. Master eligible nodes and dataless non-master eligible nodes must use the virtual disk devices that are serviced by the control domain.

The control/service domain will be a single point of failure if Netra HA Suite is used with LDoms. If the control domain fails, all the other domains on the same system will also fail.

64-Node Cluster Configurations

Netra HA Suite 3.0 1/08 supports 64-node cluster configurations (a master node, a vice-master node, and 62 dataless nodes). However, configurations using Advanced Telecommunications Computing Architecture (ATCA)-based hardware have been qualified at the hardware level with a maximum of 12 nodes.

This release supports 64-node dataless clusters. Cluster performance (for example, the time required for switchover, failover, and boot) depends on the number of client nodes (master-ineligible nodes) in the cluster. When there are more than 18 client nodes, we suggest that you use server nodes (master-eligible nodes) that are more powerful than your client nodes to get expected performance results.

Service Limitations on a Netra HA Suite Cluster Running Linux

The following limitations exist when you run the Netra HA Suite software on a cluster where all or some nodes are running under Linux:


Service Availability Forum (SA Forum) Support

Netra HA Suite 3.0 1/08 software provides the following new functions through the SA Forum CLM API.

The initialViewNumber field of the saClmClusterNodeT structure is supported for use with the Netra HA Suite 3.0 1/08 software. For information, refer to the Netra High Availability Suite 3.0 1/08 Foundation Services SA Forum Programming Guide. For information about these functions, go to http://www.saforum.org/

The following values apply to the SA Forum/CLM man pages when they are used with the Netra High Availability Suite Foundation Services:


TABLE 2 Changes to the SA Forum/CLM Man Pages

Section in Man Page

Value Inherited From the SA Forum Organization Man Pages

Value For Use With Netra HA Suite

SYNOPSIS

Line that begins:
???cc [ flag... ] file...

cc [ flags... ] file... -lSaClm

 

include xxx.h

include <saClm.h>

ATTRIBUTES

 

SUNWnhsas-safclm-headers



Installation

Automated installation procedures described in the Netra High Availability Suite 3.0 1/08 Foundation Services Installation Guide have been adapted for the support of Solaris 10 8/07 OS on SPARC/CMT and x64 processors, and the support of MontaVista Linux Carrier Grade Edition 4.0 and WindRiver PNE-LE 1.4 Linux distributions.

All corresponding manual installation procedures have been detailed for the Solaris OS only, in the Netra High Availability Suite 3.0 1/08 Foundation Services Manual Installation Guide for the Solaris OS. No manual installation procedure is described for Linux.


Supported Hardware

TABLE 3 summarizes the hardware supported with the Netra HA Suite 3.0 1/08 software as of publication of this document. for more information about operating systems supported with Netra HA Suite for each platform, see TABLE 4.


TABLE 3 Supported Hardware Platforms for Netra HA Suite 3.0 1/08 Software

Servers

Netratrademark 120 servers

Netra 240/440 servers

Netra CT 410/CT 810 servers

Netra CT 900 blade server (ATCA chassis)

Netra 1290 servers

Netra T2000 servers (chip multithreading [CMT])

Netra T5220 servers(CMT)

Netra X4200 (x64)

Sun Firetrademark V210, V240, V440 servers

Boards

Netra CP2140/CP2160, and CP2500

Netra CP3010 ATCA SPARC blades

Netra CP3020 ATCA x64 blades

Netra CP3060 ATCA blades (CMT)

Netra CP3220 ATCA x64 blades

Netra CP3260 ATCA blades (CMT)

Ethernet Cards

Ethernet 10/100

1 Gbit

Disks

SCSI disks

FC-AL disks

IDE disks

Sun StorEdgetrademark 3310 disk array




Note - For information about required patches and firmware versions for Netra CP3010 ATCA SPARC blades, Netra CP3020 or CP 3220 ATCA x64 blades, Netra CP3060 or CP3260 ATCA CMT blades, or Netra CT 900 servers, refer to the appropriate release notes, which can be downloaded from: http://docs.sun.com.




Note - For information about iSCSI support with the Netra HA Suite, contact your support representative.


Suggested Switches

On Netra CT 900 servers, Base Fabric Ethernet switches (respectively, Extended Fabric Ethernet switches) are interconnected. This factory-preset configuration might lead to unexpected behavior on Linux if left unmodified because redundant network interfaces used by Netra HA Suite are in the same broadcast domain.

It is strongly suggested that you use redundant network interfaces in different broadcast domains. This can be achieved in a variety of ways. For example, you can disable the interconnect between switches or configure VLAN on switches. Having one interface on the Base Fabric and the other on Extended Fabric is not suggested because the technologies of the two fabrics differ.

Refer to the Netra CT 900 Server Administration and Reference Manual and the Netra CT 900 Server Switch Software Reference Manual for more information.

Mixed Hardware Configurations

The following mixed hardware configurations are supported for use on clusters running this release of the Netra HA Suite software.



Note - Blades running the Solaris OS are always MEN nodes, and blades running Linux are always NMEN nodes.



Supported Software Versions

This section lists the software you can use with the Netra HA Suite 3.0 1/08 and specifies the supported versions for different types of hardware.

Supported Operating Systems

The following servers and boards are supported for use on clusters that have the following versions of operating system (OS) installed.


TABLE 4 Supported Operating Systems and Hardware

OS Version

Server and Boards in Use

Solaris 9 9/05 OS with Solaris 9 9/05 HW

  • Netra CP3010 ATCA SPARC blades
  • Netra CT 900 blade servers
  • Netra 120/240/440 servers
  • Sun Fire V210/V240/V440 servers
  • Netra CT 410/CT 810 blade servers
  • Netra CP2140/CP2160, and CP2500 SPARC blades

Solaris 10 1/06 OS

  • Netra CP3010 ATCA SPARC blades
  • Netra CP3020 ATCA x64 blades
  • Netra CT 900 blade server
  • Netra 120/240/440 servers
  • Netra 1290 servers
  • Sun Fire V210/V240/V440 servers

Solaris 10 8/07 OS

  • Netra CP3010 ATCA SPARC blades
  • Netra CP3020 ATCA x64 blades
  • Netra CP3060 ATCA CMT blades
  • Netra CP3220 ATCA x64 blades
  • Netra CP3260 ATCA CMT blades
  • Netra CT 900 blade servers
  • Netra 120/240/440 servers
  • Netra 1290 servers
  • Netra T2000 CMT servers
  • Netra T5220 CMT servers
  • Netra X4200 servers (Opteron x64)
  • Sun Fire V210/V240/V440 servers

MV CGE 4.0 Linux

  • Netra CP3020 ATCA x64 blades
  • Netra CT900 blade server

Wind River PNE-LE 1.4

  • Netra CP3020 ATCA x64 blades
  • Netra CT900 blade server
  • Netra X4200 servers (Opteron x64)

Wind River PNE-LE 2.0

  • Netra CP3020 or CP3220 ATCA x64 blades
  • Netra CT900 blade server
  • Netra X4200 servers (Opteron x64)

For example cluster configurations, see the Netra High Availability Suite 3.0 1/08 Foundation Services Getting Started Guide.

Volume Management Software

The following volume management software is supported for use with the Netra HA Suite 3.0 1/08 software:

Embedded Software

The following software is embedded with the release of Foundation Services 3.0:

The following versions of data replication software are supported on the specified versions of operating system.



Note - AVS 3.2 is not supported for use with the Foundation Services software.


Development Tools

The following development tools are supported for use with this release of the Foundation Services software:


Netra HA Suite 3.0 1/08 Patches

For the Netra HA Suite 3.0 1/08 software to be properly installed and operational, you must download a set of patches and apply them to the Netra HA Suite 3.0 FCS. To download the patches, visit the SunSolveSM web site:

http://www.sun.com/sunsolve

TABLE 5 lists the required patches for each supported operating system.


TABLE 5 Required Netra HA Suite 3.0 1/08 Patches

Operating System

Required Patches

Solaris 9 9/05 OS or

Solaris 9 9/05 HW OS

124480

Solaris 10 1/06 or

Solaris 10 8/07

124481 (SPARC)

124482 (x64)

MV CGE 4.0

124483

Wind River CGL PNE-LE 1.4

124484




Note - For Netra HA Suite 3.0 1/08 software, you need to install the level 5 version (-05) of these patches, at a minimum.


During the first reboot after patches 124481-05 or 124482-05 are applied on a Solaris cluster Master Eligible node, the following message appears on the console:


svc.startd[7]: Transitioning svc:/system/cgha/rnfs/server:default to  maintenance
because it completes a dependency cycle (see svcs -xv for details):
svc:/system/nws_rdcsyncd
svc:/system/nws_rdcsyncd:default
svc:/milestone/multi-user
svc:/milestone/multi-user:default
svc:/system/cgha/rnfs/server
svc:/system/cgha/rnfs/server:default 

This message appears when changes in the NHAS services are taken into account by the Solaris Management Facility, but it can be safely ignored. The service status svc:/system/cgha/rnfs/server:default will subsequently be cleared, and this service will be restarted correctly at the end of the boot.

Installing Patches for Netra HA Suite 3.0 1/08 on an Existing Cluster

If you already have a cluster up and running with the Netra HA Suite 3.0 software and you want to upgrade it to Netra HA Suite 3.0 1/08, install the above-mentioned Netra HA Suite patches using the procedure described in the README files delivered with the Netra HA Suite patches.



Note - If you are running the Solaris 10 1/06 OS, there are additional Solaris OS patches that must be installed before you install the Netra HA Suite patches. For more information, see Solaris OS Patches.


Installing Patches for Netra HA Suite 3.0 1/08 on a New Cluster

If you are installing a new cluster, you can use the nhinstall tool to perform an automated full installation of the Netra HA Suite 3.0 1/08 software. To do this, install the Netra HA Suite 3.0 FCS packages and patches on your installation server and follow the procedure described in the README files delivered with the Netra HA Suite patches.



Note - When installing a new cluster on the Solaris 10 OS, after the Solaris 10 8/07 Operating System and Netra HA Suite 3.0 1/08 software are installed, you must then install the latest recommended Solaris patches for the platform architecture. See the following description, “To Install the Latest Recommended Solaris Patches.”



procedure icon  To Install the Latest Recommended Solaris Patches

1. Install the Solaris 10 8/07 Operating System and Netra HA Suite 3.0 1/08 software.

Use the nhinstall tool to install the Netra HA Suite software.

2. Obtain the latest “Recommended Solaris Patch Cluster” for Solaris 10 and the platform architecture used by your system.

You can download these patches from:

http://sunsolve.sun.com

3. Disable the Netra HA Suite software.


% touch /etc/opt/SUNWcgha/not_configured 

4. Run the installation script.

The script is bundled with the patches.

The procedure is not “nhinstall friendly,” as the patch install script might require several reconfiguration reboots.

5. Re-enable the Netra HA Suite software.

After you have finished installing the patches, remember to re-enable the software by removing the not_configured file.


Solaris OS Patches

When installing the Foundation Services software, install the latest version of following patches that are available on the SunSolve web site, depending on the version of the Solaris OS that is installed on your system:

At a minimum, install a patch for init s/init 3 sequence: 127111-09 or higher (SPARC) and 127112-09 or higher (x64).

At a minimum, install:

- 118833 (SPARC): Before installing patch 118833, install patches 118918-13, 119042-09, and 119578-30, in this order, and reboot the node.

- 118855 (x86): Before installing patch 118855, install patches 119043, 118344, 123840, 122035 (in this order) and reboot the node.

These patches must be manually installed if you want to upgrade an existing Netra HA Suite 3.0 cluster to Netra HA Suite 3.0 1/08.

These patches are automatically installed if you use the nhinstall tool. If you manually install the software, download these patches from SunSolve.



Note - On the Solaris 9 9/05 HW OS, these patches are not required.



SNDR Patch

The Netra HA Suite download contains one SNDR patch: 116710-03. This SNDR/AVS point patch replaces the SNDR patches released with the previous version of the software and should be installed only if you are running the Solaris 9 9/05 OS and Solaris 9 9/05 HW OS (SNDR 3.1). No patch should be installed for AVS 4.0.

This SNDR patch is available on SunSolve at http://sunsolve.sun.com/point.


Carrier Grade Transport Protocol (CGTP) Patches

No software patches for CGTP are required for the Solaris 9 9/05 OS and Solaris 9 9/05 HW OS or Solaris 10 OS.

A CGTP patch must be added to standard Linux kernels if you choose not to use the Linux kernel delivered with the Netra HA Suite 3.0 1/08 patches for Linux distributions. In this case, you must rebuild your Linux kernel using the CGTP source patch delivered with the Netra HA Suite 3.0 1/08 patches. For help rebuilding your kernel with CGTP, contact your authorized service representative.


Product Recommendations

The following sections describe recommended uses of particular functionalities and features of the Foundation Services.

Use of the reboot Command

When rebooting a master-eligible node on a running cluster, do not use the reboot command. Doing so will kill processes in an indeterminate order, effectively ignoring the required sequence for stopping services, which can lead to inconsistencies in data replication.

Instead, reboot a node using the steps provided in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide, which vary depending on the version of the operating system in use at your site.

Scheduling Major Tasks When the Cluster Is Unsynchronized

When a master-eligible node is reintegrated into the cluster (for example, after maintenance or failure), there is a period when disk partitions are resynchronizing. While a cluster is unsynchronized, the data on the master node disk is not fully backed up. Do not schedule major tasks when the cluster is unsynchronized.

Recovering From Node Failure

The symptoms of node corruption can include the presence of “maintenance required” messages, nodes remaining at run level 1, the inability to execute basic UNIX® commands (for example, ls, pwd, and cd), and the presence of messages about recovering the repository using archives.

If, when installing clusters, you experience any of these symptoms and determine that a node failure has occurred, manually recover the node(s) by following the procedures in the README file included with the Solaris 10 OS distribution (/lib/svc/share/README). For specific examples, refer to Section 2 of the README file.

File Locking

Due to issues with the Linux kernel, file locking is not supported for use with the Netra HA Suite 3.0 8/07 Foundation Services software. For more information, refer to Linux Known Issues.


Known Issues

The following subsections list known bugs and their workarounds where available.

Most Common Issues

TABLE 6 describes the issues most commonly encountered when using the Foundation Services, beginning with the issue of which you should be most aware.


TABLE 6 Most Commonly Experienced Issues

Bug

Description

5065254

NFS client server deadlock

The Solaris OS does not support the ability for a node with an NFS client to access data exported by an NFS server on the same host. In this case, if the NFS client writes large files to the NFS server, the OS deadlocks, and the node hangs. This can occur if an application on the vice-master fails over or is switched over to the master. The master node hangs and might be unable to function as a master. If you encounter this situation, reboot the hung node.

To avoid this error, do not use loopback NFS mounts. Use local paths in applications running on the master node. For example, use /SUNWcgha/local to access files instead of using remote paths like /SUNWcgha/remote.

You can also avoid this error by using NFS protocol version 2 on master eligible nodes. For example, set NFS_CLIENT_VERSMAX=2 in /etc/default/nfs on the Solaris 10 OS and newer, or add the vers=2 option to /etc/vfstab for particular file systems, see mount_nfs(1M)).

4964345

SNDR sets using sector 0 fail, which is not detected by nhinstall/nhadm

Do not use sector zero in any slice that will be replicated. If you do, the cluster will hang on the final step of SNDR synchronization. You might encounter this situation if you install more than one disk on a node using nhinstall.

6208336

CMM_ETIMEDOUT error is displayed when performing a switchover on 64 nodes

On large clusters (for example, 40 plus-node clusters), this error appears on the master node when you perform a switchover using nhcmmstat command, even if the switchover succeeds.

To resolve the error described above, Foundation Services allows you to define the timeout period (in seconds) for a particular run of the nhcmmstat command. This is done using the -m option of the nhcmmstat command. If you receive the above error, increase the value of this option.

The command line argument is -m <timeout>. The default value for this option is five seconds. The following example shows how to trigger a switchover with a timeout of 6 seconds.

/opt/SUNWcgha/sbin/nhcmmstat -m 6 -c so

6218803

DHCP table corruption

Each diskless node has its own dhcpagent file in the exported / (root) partition on the master server. For example, /export/root/[nodename]/etc/default/dhcpagent

It is possible for this file to become corrupted when the diskless node crashes or goes down without a file system sync. If this occurs, the diskless node will not boot until the file is repaired on the master server.

To avoid file corruption, you can keep local copies of such dhcp tables on master and vice-master servers. The synchronization of these local copies is then out of scope from NHAS Foundation Service software and falls under manual system administration.

You can use REPLICATED_DHCP_FILES=NO option in the cluster_definition.conf file while using nhinstall.

6433544

ifconfig: plumb: bge1: No such file or directory

On first boot, ATCA diskless nodes might hang after not being able to configure bge1, eventually returning the preceding error.

To avoid this error, send break to the hung nodes and boot them again. When this problem occurs, it is only during the first boot.


Linux Known Issues

TABLE 7 describes issues that exist using the Foundation Services with MontaVista Carrier Grade Edition and Wind River Linux.


TABLE 7 Known Issues for Linux Distributions

Bug

Description

6475071

CGTP fails on Linux occasionally

Configure CGTP’s gateway table on pure Linux clusters (gateway tables should normally be used only with heterogeneous, Linux-Solaris clusters). If a Linux cluster is installed using nhinstall, no further action is needed because the nhinstall tool will configure CGTP’s gateway table.

If the installation is done manually, without using nhinstall, the gateway table must also be manually populated.

See Example: Configuring CGTP’s Gateway Table on Linux for an example of configuring a gateway table for a three-node cluster.

6483297

CMM_ETIMEDOUT

On Linux, when multiple threads intensively use the CMM API or SA Forum CLM API, because of the way Linux schedules the threads, some calls might return CMM_ETIMEDOUT. Users can safely retry the operation.

6485586, 6472470, 6616254

local0.debug nhas.log

The syslog facility can be configured to log Netra HA Suite messages as described in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

On Linux, syslog can be very slow. Therefore, when configuring syslog to get Netra HA Suite messages of info or debug level, it is strongly suggested that you omit syncing of the log file after logging by prefixing entries in the /etc/syslog.conf file with the minus sign “-” as described in the syslog.conf(5) man page.

6489600

Locks are lost upon switch over or fail over on Linux

Due to an issue with the Linux kernel, file locking is not supported for use with the Netra HA Suite 3.0 1/08 Foundation Services. Using file locking on the replicated partitions works until a failover or switchover is triggered, then locks are lost.

6675034

Confusing message from bonding when executing a switchover

When a switchover is requested, you might see a message on the console stating that a bond interface (bondX) has failed. You can safely ignore this message, as it has no real impact on the system.


Example: Configuring CGTP’s Gateway Table on Linux

If there is a cluster with two master-eligible nodes and one non-master-eligible node:


     MEN-1 - eth0 has 10.191.1.10
         eth1 has 10.191.2.10
         cgtp0 has 10.191.3.10
     MEN-2 - eth0 has 10.191.1.20
         eth1 has 10.191.2.20
         cgtp0 has 10.191.3.20
     NMEN -  eth0 has 10.191.1.30
         eth1 has 10.191.2.30
         cgtp0 has 10.191.3.30 
 

On MEN-1, the following must be done:


     echo "add 10.191.3.20 10.191.1.20 eth0" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.20 10.191.2.20 eth1" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.30 10.191.1.30 eth0" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.30 10.191.2.30 eth1" > /proc/net/cgtp0/gateway

On MEN-2, the following must be done:


     echo "add 10.191.3.10 10.191.1.10 eth0" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.10 10.191.2.10 eth1" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.30 10.191.1.30 eth0" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.30 10.191.2.30 eth1" > /proc/net/cgtp0/gateway

On NMEN, the following must be done:


     echo "add 10.191.3.20 10.191.1.20 eth0" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.20 10.191.2.20 eth1" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.10 10.191.1.10 eth0" > /proc/net/cgtp0/gateway
     echo "add 10.191.3.10 10.191.2.10 eth1" > /proc/net/cgtp0/gateway

This should be added to /etc/network/interfaces to ensure that gateway table entries are automatically added after reboot (or ifdown/ifup), for example, for MEN-1:


# CGTP interface
auto cgtp0
iface cgtp0 inet static
   ...
   up echo "add 10.191.3.20 10.191.1.20 eth0" > /proc/net/cgtp0/gateway
   up echo "add 10.191.3.20 10.191.2.20 eth1" > /proc/net/cgtp0/gateway
   up echo "add 10.191.3.30 10.191.1.30 eth0" > /proc/net/cgtp0/gateway
   up echo "add 10.191.3.30 10.191.2.30 eth1" > /proc/net/cgtp0/gateway 

Cluster Membership Manager (CMM)


TABLE 8 Known Issues for CMM

Bug

Description

4697437

Notifications of Diskless Node State Transitions Can Be Lost

Notifications that describe the difference between an initial state and a final state are emitted by the CMM on the master node when the cluster membership changes. The CMM running on a diskless node can miss notifications for transitory states. For example, when a cluster passes through three states (CC1, CC2, and CC3), a notification should be emitted to describe the transition from CC1 to CC2, and then to describe transition from CC2 to CC3. In this release of the product, a diskless node might only receive the notification for the overall transition from CC1 to CC3. The diskless node might miss the notification for the transient state CC2.

When a cluster passes from state CC1 to CC2, and then back to state CC1, the diskless node might not receive any notification.

4746183

Single Point of Failure Occurs Immediately After Switchover

A single point of failure exists for a brief period of time after a switchover. The single point of failure lasts until Reliable NFS receives notifications from CMM, MASTER_ELECTED and VICE_MASTER_ELECTED.

Use the nhcmmstat tool to check which notifications have been received.

If the newly elected master node reboots before notifications are received, refer to the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide for information about how to recover a cluster.

4740446

Switchover Is Initiated Even Though the CMM_FLAG_SYNCHRO_NEEDED Flag Is Set

There is a small time frame between the issuance of a command to change the synchronization state at the API level and the moment when the nhcmmd daemon handles the command. If a switchover request is issued within the time, the switchover request is accepted even if the cluster is no longer synchronized. In this scenario, a call from Reliable NFS to clear the CMM_FLAG_SYNCHRO_NEEDED flag will fail because a switchover is in progress. Therefore, the master node reboots and the replication stops until the vice-master node is rebooted.

Verify that the CMM_FLAG_SYNCHRO_NEEDED flag is clear before requesting a switchover. To recover, reboot the vice-master node.

4749139

Library Clients Should Rely on Local Notifications Only

When a master-eligible node is elected as the vice-master node, the master node notifies the other peer nodes just before the data in the master node API module is updated. As a result, the cmm_vicemaster_getinfo() function called on the master node can fail and return a CMM_ESRCH error, even though the CMM library clients on the other peer nodes have already received the CMM_VICEMASTER_ELECTED notification.

See the Netra High Availability Suite 3.0 1/08 Foundation Services CMM Programming Guide for more information.

4845598

Diskless Node Emits CMM_INVALID_CLUSTER Notification When Master Is Disqualified

When the master node is disqualified by the cmm_membership_qualif function, the nhcmmd daemon on an associated diskless node might emit a CMM_INVALID_CLUSTER notification. Ignore the notification. The cluster is up and running.

4928087

Switchover + full synch operation generates a duplicate floating address

When you switchover (/opt/SUNWcgha/sbin/nhcmmstat -c so) in parallel with a full synchronization when the two master-eligible nodes are synchronized (/opt/SUNWcgha/sbin/nhcrfsadm -f), the following events occur:

  • The nhcmmd daemon engages the switchover.
  • The starting of a full synchronization of the master-eligible nodes changes the nodes' state from READY to SYNCHO NEEDED.
  • The vice-master becomes master and sets its master IP address to UP (The master IP address is always plumbed on both master-eligible nodes but this address is set to DOWN on the vice-master node).

As a result of this sequence of events, the Reliable NFS cannot set the master IP address to DOWN because this action cannot take place while a full synchronization is in progress.

If you encounter this problem, wait until the full synchronization is complete. This might take some time.


Reliable NFS Known Issues


TABLE 9 Known Issues for Reliable NFS

Bug

Description

5065254

NFS client server deadlock

4964345

SNDR sets using sector 0 fail, which is not detected by nhinstall/nhadm


CGTP Known Issues


TABLE 10 Known Issues for CGTP

Bug

Description

4740370

CGTP Broadcast IRE Are Not Recreated After plumb or unplumb

Use of the ifconfig command to plumb or unplumb the CGTP interface is not supported. Using the ifconfig command in this way can lead to unexpected cluster outage.

Action on a single interface leads to inoperative CGTP broadcasts. Broadcasts replicated by CGTP might not be delivered if one of the underlying incoming interfaces is down, and, for the same reason, if the interface has been unplumbed. CGTP broadcasts cannot survive the brutal unplumbing/replumbing of the underlying network interfaces.

The only way for CGTP broadcasts to survive an ifconfig unplumb is to always respect the following sequence of operations:

  • Delete the CGTP routes that cross the interface being unplumbed.
  • Unplumb the interface.
  • Replumb a new interface.
  • Redeclare the previous CGTP routes.

6475071

“CGTP fails on Linux occasionally”

See Linux Known Issues for information.


Reliable Boot Service (RBS) Known Issues


TABLE 11 Known Issues for RBS

Bug

Description

6208336

CMM_ETIMEDOUT errors is displayed when performing a switchover on 64 nodes

6218803

DHCP table corruption

6267056

Only ViceMaster Node will stay up in the cluster when performing switchover+full synchronization

When you switchover (/opt/SUNWcgha/sbin/nhcmmstat -c so) there is a brief window during which launching a full synchronization (/opt/SUNWcgha/sbin/nhcrfsadm -f) will cause the cluster to lose its master node (and, therefore, the diskless nodes) and only the vice master node will stay up.

If you encounter this problem, recover the cluster by flushing the SNDR configuration. For more information about recovering a cluster, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

6290647

Warnings when MEN rejoins large cluster

On large clusters, when one of the MENs rejoins the cluster, users might see some warnings on the console of the joining node: /var/run/CMM_xxx_00000000 fails: Resource temporarily unavailable

This means that some membership/mastership notifications have been lost in that node. You might want to have client applications update their view of the cluster to a coherent state by calling the CMM API cmm_member_getall() function or the SAF CLM API saClmClusterTrack() function with the SA_TRACK_CURRENT flag.



Documentation Details

The following table lists the guides that make up the current documentation set and briefly describes the type of information they contain. The documentation can be found at:

http://docs.sun.com/app/docs/prod/netra.avail


Application

Title

Part Number

Late-breaking news

Netra High Availability Suite 3.0 1/08 Release Notes

819-5249-14

Introduction to concepts

Netra High Availability Suite 3.0 1/08 Foundation Services Overview

819-5240-13

Basic setup, supported hardware and configurations

Netra High Availability Suite 3.0 1/08 Foundation Services Getting Started Guide

819-5241-13

Automated installation methods

Netra High Availability Suite 3.0 1/08 Foundation Services Installation Guide

819-5242-13

Detailed installation methods

Netra High Availability Suite 3.0 1/08 Foundation Services Manual Installation Guide for the Solaris OS

819-5237-13

Cluster administration

Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide

819-5235-13

Using the Cluster Membership Manager

Netra High Availability Suite 3.0 1/08 Foundation Services CMM Programming Guide

819-5236-13

Using the Service Availability (SA) Forum CMM API

Netra High Availability Suite 3.0 1/08 Foundation Services SA Forum Programming Guide

819-5246-13

Using the Node Management Agent

Netra High Availability Suite 3.0 1/08 Foundation Services NMA Programming Guide

819-5239-13

Configuring outside the cluster using CGTP

Netra High Availability Suite 3.0 1/08 Foundation Services Standalone CGTP Guide

819-5247-13

Man pages for Foundation Services features and APIs using the Solaris OS

Netra High Availability Suite 3.0 1/08 Foundation Services Reference Manual

819-5244-13

Man pages for Foundation Services features and APIs using Linux

Netra High Availability Suite 3.0 1/08 Foundation Services Linux Reference Manual

819-5245-12

Common problems

Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide

819-5248-13

Definitions and acronyms

Netra High Availability Suite 3.0 1/08 Foundation Services Glossary

819-5238-13


Documentation Known Issues

The following known issue exists in this release of the Netra HA Suite Foundation Services documentation set.

Definition for the init. election Field for the nhcmmstat man page

The init. election field is currently not documented in the nhcmmstat man page or Netra High Availability Suite 3.0 1/08 Foundation Services Reference Manual. The following definition applies for this field:

init. election
The election number the cluster had when the node joined the cluster. The election number is increased each time there is a change in the cluster membership, so a node joining the cluster before another node will have a lower election number than the latter.

Man Page Listed Erroneously

The Intro(1M) man page for Solaris erroneously lists a man page for an nhpmdadmwrapper(1M) command. This command is not available, and its man page is not included with this distribution.