Failover concept

Functional principle


The failover concept exists to ensure the availability of the Node Controller.

There is always exactly one active Node Controller in a cluster. The number of Working Nodes is unlimited and can be changed at any time.

In principle, each Working Node (by license and configuration) can be in working mode as well as in controller mode. The modes can change during operation. Node Controllers (by license and configuration) can only be in controller mode. Changing the operation mode of a node can occur automatically due to a detected failover, or it can be explicitly initiated by the user (in the Control Center or via HTTP).

Always the last Node Controller that goes online is the active Node Controller. Old Node Controllers shut down unless they previously were a Working Node by license and configuration. In this case, it changes back to a Working Node and does not shut down.

A valid operating state with a Node Controller and Working Node(s) must be reached at least once, because Working Nodes receive their setup from the Node Controller at startup.

If the connection of any Working Node to the Node Controller is interrupted, the failover is triggered. Note: See Heartbeat interval at parameter addNode and parameter externalURL.

Load Balancing service setup


To use the failover functionality, you must enable the load balancing service. This requires adjustments to the following configuration files.

Adjusting configuration file ./etc/factory.xml on all nodes of the cluster


On all nodes (Node Controller and Working Nodes) the "LoadBalanceService" (see below) must be activated in the configuration file ./etc/factory.xml . Important note: The entry must be inserted before the "StartupService" (see below), otherwise the failover cannot be accessed.


./etc/factory.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Configure PUBLIC
"-//Lobster//DTD Configure 1.0//EN"
"http://www.lobster.de/dtd/configure_1_1.dtd">
 
<Configure class="com.ebd.hub.services.ServiceFactory">
<!-- set a unique id for this factory - needed for cluster and load balance -->
<Set name="id">factoryID</Set>
 
   .
.
.
 
<!-- service for lb/ha cluster -->
<Call name="addService">
<Arg>com.ebd.hub.services.ha.LoadBalanceService</Arg>
<Arg>etc/loadbalance.xml</Arg>
</Call>
 
.
.
.
<!-- service to start applications... -->
<Call name="addService">
<Arg>com.ebd.hub.services.startup.StartupService</Arg>
<Arg>etc/startup.xml</Arg>
</Call>


Parameter

Description

id

Required for the identification and loading of the configuration. Important note: Must be unique in a cluster.

Adjusting configuration file ./etc/loadbalance.xml on all nodes of the cluster


On all nodes (Node Controller and Working Nodes) the file ./etc/loadbalance.xml (or the file specified in the configuration file ./etc/factory.xml for the LoadBalanceService) must be present and adjusted.

The same file is used on all nodes, which means you only need to create/adjust it once and then you can simply copy it to all existing nodes.


./etc/loadbalance.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Configure PUBLIC "-//Lobster//DTD Configure 1.0//EN"
"http://www.lobster.de/dtd/configure_1_1.dtd">
<Configure class="com.ebd.hub.services.ha.LoadBalanceService">
<!-- If loadbalance should be enabled -->
<Set name="enableLoadbalance">true</Set>
<!-- If failover should be enabled -->
<Set name="enableFailover">true</Set>
 
<Set type="int" name="controllerRequestSize">10</Set>
<Set type="long" name="controllerThreshold">5000</Set>
 
  <!-- Should nodes try to reconnect after lost connection, interval in seconds -->
<Set name="reconnectInterval">1</Set>
<!-- Should a failover be done by shutting down the node controller -->
<Set name="failoverAtShutdown">false</Set>
<!-- Configure when emails should be send -->
<Set name="emailByLogoff">false</Set>
<Set name="emailByLogin">false</Set>
<Set name="emailByLostConnection">false</Set>
<!-- Add more recipients to inform via email -->
<!-- <Call name="addEmail"><Arg></Arg></Call> -->
<!-- Use external proxy service instead of externalUrl -->
<!-- <Call name="setProxySettings">
<Arg>localhost</Arg>
<Arg type="long">500</Arg>
<Arg type="long">100</Arg>
<Arg type="int">9006</Arg>
</Call> -->
<!-- Behaviour like old externalUrl -->
<Call name="setExternalUrl">
<Arg>https://www.google.de</Arg>
<!-- Timeout for Http-Request -->
<Arg type="int">1000</Arg>
</Call>
<!-- Own node information -->
<Call name="addNode">
<!-- Nodename (has to be unique in cluster) -->
<Arg>FactoryID</Arg>
<!-- Hostname (DNS, IPv4, IPv6) -->
<Arg>ip-address</Arg>
<!-- Retries before removing node from cluster -->
<Arg type="int">3</Arg>
<!-- Timeout for a read in ms -->
<Arg type="long">300</Arg>
<!-- Heartbeat intervall to nodes in ms -->
<Arg type="long">100</Arg>
<!-- Port for failover handling -->
<Arg type="int">2320</Arg>
<!-- Preferred role in cluster (MASTER, SLAVE) -->
<Arg>MASTER</Arg>
<!-- Message port for load balancing -->
<Arg type="int">8020</Arg>
</Call>
<!-- Own node information -->
<Call name="addNode">
<!-- Nodename (has to be unique in cluster) -->
<Arg>FactoryID2</Arg>
<!-- Hostname (DNS, IPv4, IPv6) -->
<Arg>ip-address</Arg>
<!-- Retries before removing node from cluster -->
<Arg type="int">3</Arg>
<!-- Timeout for a read in ms -->
<Arg type="long">300</Arg>
<!-- Heartbeat intervall to nodes in ms -->
<Arg type="long">100</Arg>
<!-- Port for failover handling -->
<Arg type="int">2320</Arg>
<!-- Preferred role in cluster (MASTER, SLAVE) -->
<Arg>SLAVE</Arg>
<!-- Message port for load balancing -->
<Arg type="int">8020</Arg>
</Call>
</Configure>


Parameter

Description

enableLoadbalance

Activates the load balancing (and thus the distribution of jobs).

enableFailover

Activates the failover (i.e. the takeover of the role of the Node Controller by a Working Node in case of loss of the current Node Controller).

controllerRequestSize

You can set the number of jobs the Node Controller processes itself before it delegates jobs to Working Nodes. The value 0 will immediately delegate jobs to Working Nodes, whereas the value 10 will only start delegating jobs if the 11th job is running simultaneously. See also controllerThreshold.

controllerThreshold

The setting controllerRequestSize can further be restricted by considering the average runtime (statistic) of a job. The value 5000 will only consider jobs with an average runtime less than 5000 ms. Note: This setting can be modified during runtime in the Control Center!

reconnectInterval

If a connection (channel) between two nodes is lost, the service tries to re-establish it (self-healing) in the specified interval (in seconds). Note: All nodes (Node Controller and Working Nodes) are connected to all other nodes. There are two channels between two nodes. A single failed channel does not yet lead to a failover. Note: We recommend setting the reconnectInterval to the next higher second relative to the timeout.

failoverAtShutdown

Specifies whether a failover should be triggered when a Node Controller is shut down in a regular manner.

emailByLogoff

Determines whether an email is sent when a node logs out of another node. Attention: Results in increased email traffic. If a node logs out of an existing cluster of three nodes, three emails are sent.

emailByLogin

Determines whether an email is sent when a node logs into another node. Attention: Results in increased email traffic. If a node logs out of an existing cluster of three nodes, three emails are sent.

emailByLostConnection

Determines whether an email is sent when the connection to a node is lost. Attention: Results in increased email traffic. If a node logs out of an existing cluster of three nodes, three emails are sent.

addEmail

Additional email addresses can be added, which will be notified on emailByLogoff, emailByLogin and emailByLostConnection. The default email address is taken from the configuration file ./etc/startup.xml.

setExternalUrl

If this parameter is set, the specified URL is checked when the Node Controller is lost. If it can be reached, the Working Node will participate in determining the new Node Controller (raft consensus algorithm). If not, the Working Node shuts down.

It is also checked, whether other Working Nodes can be reached. If they are reachable and can still ping the current Node Controller, the Working Node shuts down.

Note: See also parameter setProxySettings .

setProxySettings

If an external proxy is configured, the external URL is not used. An external proxy can be used to prevent a "split brain", but is also a "single point of failure". Please contact support@lobster.de for more information or setup.

Note: If none of the parameters setExternalUrl and setProxySettings is used, only a reachability check between the nodes is performed and responded with a failover or a shutdown.

addNode

An entry must be created for each node contained in the cluster (Node Controller and Working Nodes). Arguments:


Nodename:

Here, the value of the id parameter in the node's configuration file ./etc/factory.xml must be used.

Hostname:

URL or IP of the node.

Retries:

Number of heartbeat ping attempts before a failover is triggered.

Timeout:

The waiting time after a heartbeat ping in milliseconds.

Heartbeat interval:

The heartbeat ping interval in milliseconds. See also Timeout and Retries.

Failover port:

The port for the failover mechanism. Default: 2320

Preferred role:

Use the value MASTER for a Node Controller and SLAVE for a Working Node. In a cluster there may only be one master, the number of slaves is unlimited.

Message port for loadbalancing:

The port for the load balancing mechanism. Default: 8020

Deactivating SAP RequestListener


See section SAP RequestListener in Load Balance Failover.

Failure of the primary DMZ server


See section DMZ cluster.