This topic discusses how the Endeca Server cluster ensures
enhanced availability of query-processing to the data domain clusters.
Important: The Endeca Server cluster provides enhanced
availability, but does not provide high availability. This topic discusses the
cluster behavior that enables enhanced availability, and notes instances where
system administrators need to take action to restore services.
The following sections discuss the Endeca Server cluster behavior for
providing enhanced availability:
This topic discusses clusters of Endeca Server nodes that contain more
than one running instance of the Endeca Server. Even though you can configure
single-node data domains hosted by single Endeca Server instances, single
Endeca Server instances can only be used in development environments, as they
do not guarantee the availability of query processing to the data domain.
Namely, in a single-node data domain hosted by a single Endeca Server instance,
a failure of the Endeca Server node leads to the shutting down of the Dgraph
process.
Availability of Endeca Server nodes
In an Endeca Server cluster with more than one
Endeca Server instance, an ensemble of the Cluster Coordinator services running
on a subset of nodes in the Endeca Server cluster ensures enhanced availability
of the Endeca Server nodes in the Endeca Server cluster.
When an Endeca Server node in an Endeca Server cluster goes down, all
Dgraph nodes hosted on it, and the Cluster Coordinator service (which may also
be running on this node) also go down. As long as the Endeca Server cluster
consists of more than one node, this does not disrupt the processing of
non-updating user requests for the data domains. (It may negatively affect the
Cluster Coordinator services. For information on this, see
Availability of Cluster Coordinator services.)
If an Endeca Server node fails, the Endeca Server cluster is notified
and stops routing all requests to the data domain nodes hosted on that Endeca
Server node, until you restart the Endeca Server node.
Let's consider an example that helps illustrate this case. Consider a
three-node single data domain cluster hosted on the Endeca Server cluster
consisting of three nodes, where each Endeca Server node hosts one Dgraph node
for the data domain. In this case:
- If one Endeca Server node
fails, incoming requests will be routed to the remaining nodes.
- If the Endeca Server node
that fails happens to be the node that hosts the leader node for the data
domain cluster, the Endeca Server cluster selects a new leader node for the
data domain from the remaining Endeca Server nodes and routes subsequent
requests accordingly. This ensures availability of the leader node for a data
domain.
- If the Endeca Server node
goes down, the data domain nodes (Dgraphs) it is hosting are not moved to
another Endeca Server node. If your data domain has more than two nodes
dedicated to processing queries, the data domain continues to function.
Otherwise, query processing for this data domain may stop until you restart the
Endeca Server node.
When you restart the failed Endeca Server node, its processes are
restarted by the Endeca Server cluster. Once the node rejoins the cluster, it
will rejoin any data domain clusters for the data domains it hosts.
Additionally, if the node hosts a Cluster Coordinator, it will also rejoin the
ensemble of Cluster Coordinators.
Availability of data domain nodes
The ensemble of Cluster Coordinator
services running on a subset of Endeca Server nodes in the cluster ensures the
enhanced availability of the data domain cluster nodes and services:
Availability of Cluster Coordinator services
The Cluster Coordinator services themselves
must be highly available. The following statements describe the requirements in
detail:
- Each Endeca Server node
in the Endeca Server cluster can be optionally configured at deployment time to
host a Cluster Coordinator instance. To ensure availability of the Cluster
Coordinator service, it is recommended to deploy the Cluster Coordinator
instances in a cluster of their own, known as an ensemble. At deployment time,
it is recommended that a subset of the Endeca Server nodes is configured to
host Cluster Coordinator services. As long as a majority of the ensemble is
running, the Cluster Coordinator service is highly available and its services
are used by the Endeca Server cluster and the data domain clusters hosted in
it. Because the Cluster Coordinator requires a majority, it is best to start an
odd number of its instances — this means that the Cluster Coordinator service
must be started on at least three Endeca Server nodes in the Endeca Server
cluster. An Endeca Server node that is configured to host a Cluster Coordinator
assumes responsibility for ensuring the uptime of the Cluster Coordinator
process it hosts — it will start the Cluster Coordinator service upon the start
of the Endeca Server, and will restart it should it stop running.
To summarize, although the Cluster Coordinator can run on only one
node, to ensure high availability of the Cluster Coordinator services, the
Cluster Coordinator service must run on at least three nodes (or an odd number
of nodes that is greater than three) in any Endeca Server cluster. This
prevents the Cluster Coordinator service itself from being a single point of
failure. For information on deploying the Cluster Coordinator in the cluster,
see the
Oracle Endeca Server Installation Guide.
- If you do not configure at
least three Endeca Server nodes to run the Cluster Coordinator service, the
Cluster Coordinator service will be a single point of failure. Should the
Cluster Coordinator service fail, access to the data domain clusters hosted in
the Endeca Server cluster becomes read-only. This means that it is not possible
to change the data domains in any way. You cannot create, resize, start, stop,
or change data domains; you also cannot define data domain profiles. You can
send read queries to the data domains and perform read operations with the
Cluster and Manage Web Services, such as listing data domains or listing nodes.
No updates, writes, or changes of any kind are possible while the Cluster
Coordinator service in the Endeca Server cluster is down — this applies to both
the Endeca Server cluster and data domain clusters. To recover from this
situation, the Endeca Server instance that was running a failed Cluster
Coordinator must be restarted or replaced (the action required depends on the
nature of the failure).