How enhanced availability is achieved

This topic discusses how the Endeca Server cluster ensures enhanced availability of query-processing to the data domain clusters.

Important: The Endeca Server cluster provides enhanced availability, but does not provide high availability. This topic discusses the cluster behavior that enables enhanced availability, and notes instances where system administrators need to take action to restore services.
The following sections discuss the Endeca Server cluster behavior for providing enhanced availability:

This topic discusses clusters of Endeca Server nodes that contain more than one running instance of the Endeca Server. Even though you can configure single-node data domains hosted by single Endeca Server instances, single Endeca Server instances can only be used in development environments, as they do not guarantee the availability of query processing to the data domain. Namely, in a single-node data domain hosted by a single Endeca Server instance, a failure of the Endeca Server node leads to the shutting down of the Dgraph process.

Availability of Endeca Server nodes

In an Endeca Server cluster with more than one Endeca Server instance, an ensemble of the Cluster Coordinator services running on a subset of nodes in the Endeca Server cluster ensures enhanced availability of the Endeca Server nodes in the Endeca Server cluster.

When an Endeca Server node in an Endeca Server cluster goes down, all Dgraph nodes hosted on it, and the Cluster Coordinator service (which may also be running on this node) also go down. As long as the Endeca Server cluster consists of more than one node, this does not disrupt the processing of non-updating user requests for the data domains. (It may negatively affect the Cluster Coordinator services. For information on this, see Availability of Cluster Coordinator services.)

If an Endeca Server node fails, the Endeca Server cluster is notified and stops routing all requests to the data domain nodes hosted on that Endeca Server node, until you restart the Endeca Server node.

Let's consider an example that helps illustrate this case. Consider a three-node single data domain cluster hosted on the Endeca Server cluster consisting of three nodes, where each Endeca Server node hosts one Dgraph node for the data domain. In this case:
  • If one Endeca Server node fails, incoming requests will be routed to the remaining nodes.
  • If the Endeca Server node that fails happens to be the node that hosts the leader node for the data domain cluster, the Endeca Server cluster selects a new leader node for the data domain from the remaining Endeca Server nodes and routes subsequent requests accordingly. This ensures availability of the leader node for a data domain.
  • If the Endeca Server node goes down, the data domain nodes (Dgraphs) it is hosting are not moved to another Endeca Server node. If your data domain has more than two nodes dedicated to processing queries, the data domain continues to function. Otherwise, query processing for this data domain may stop until you restart the Endeca Server node.

When you restart the failed Endeca Server node, its processes are restarted by the Endeca Server cluster. Once the node rejoins the cluster, it will rejoin any data domain clusters for the data domains it hosts. Additionally, if the node hosts a Cluster Coordinator, it will also rejoin the ensemble of Cluster Coordinators.

Availability of data domain nodes

The ensemble of Cluster Coordinator services running on a subset of Endeca Server nodes in the cluster ensures the enhanced availability of the data domain cluster nodes and services:
  • Failure of the leader node. When the leader node goes offline, the Endeca Server cluster elects a new leader node and starts sending updates to it. During this stage, follower nodes continue maintaining a consistent view of the data and answering queries. When the node that was the leader node is restarted and joins the cluster, it becomes one of the follower nodes. Note that is also possible that the leader node is restarted and joins the cluster before the Endeca Server cluster needs to appoint a new leader node. In this case, the node continues to serve as the leader node.

    If the leader node in the data domain changes, the Endeca Server continues routing those requests that require the leader node to the Endeca Server cluster node hosting the newly appointed leader node.

    Note: If the leader node in the data domain cluster fails, and if an outer transaction has been in progress, the outer transaction is not applied and is automatically rolled back. In this case, a new outer transaction must be started. For information on outer transactions, see the section about the Transaction Web Service in the Oracle Endeca Server Developer's Guide.
  • Failure of a follower node. When one of the follower nodes goes offline, the Endeca Server cluster starts routing requests to other available nodes, and attempts to restart the Dgraph process for this follower node. Once the follower node rejoins the cluster, the Endeca Server adjusts its routing information accordingly.

Availability of Cluster Coordinator services

The Cluster Coordinator services themselves must be highly available. The following statements describe the requirements in detail:
  • Each Endeca Server node in the Endeca Server cluster can be optionally configured at deployment time to host a Cluster Coordinator instance. To ensure availability of the Cluster Coordinator service, it is recommended to deploy the Cluster Coordinator instances in a cluster of their own, known as an ensemble. At deployment time, it is recommended that a subset of the Endeca Server nodes is configured to host Cluster Coordinator services. As long as a majority of the ensemble is running, the Cluster Coordinator service is highly available and its services are used by the Endeca Server cluster and the data domain clusters hosted in it. Because the Cluster Coordinator requires a majority, it is best to start an odd number of its instances — this means that the Cluster Coordinator service must be started on at least three Endeca Server nodes in the Endeca Server cluster. An Endeca Server node that is configured to host a Cluster Coordinator assumes responsibility for ensuring the uptime of the Cluster Coordinator process it hosts — it will start the Cluster Coordinator service upon the start of the Endeca Server, and will restart it should it stop running.

    To summarize, although the Cluster Coordinator can run on only one node, to ensure high availability of the Cluster Coordinator services, the Cluster Coordinator service must run on at least three nodes (or an odd number of nodes that is greater than three) in any Endeca Server cluster. This prevents the Cluster Coordinator service itself from being a single point of failure. For information on deploying the Cluster Coordinator in the cluster, see the Oracle Endeca Server Installation Guide.

  • If you do not configure at least three Endeca Server nodes to run the Cluster Coordinator service, the Cluster Coordinator service will be a single point of failure. Should the Cluster Coordinator service fail, access to the data domain clusters hosted in the Endeca Server cluster becomes read-only. This means that it is not possible to change the data domains in any way. You cannot create, resize, start, stop, or change data domains; you also cannot define data domain profiles. You can send read queries to the data domains and perform read operations with the Cluster and Manage Web Services, such as listing data domains or listing nodes. No updates, writes, or changes of any kind are possible while the Cluster Coordinator service in the Endeca Server cluster is down — this applies to both the Endeca Server cluster and data domain clusters. To recover from this situation, the Endeca Server instance that was running a failed Cluster Coordinator must be restarted or replaced (the action required depends on the nature of the failure).