Failover concept
Functional principle
The failover concept exists to ensure the availability of the Node Controller.
There is always exactly one active Node Controller in a cluster. The number of Working Nodes is unlimited and can be changed at any time.
In principle, each Working Node (by license and configuration) can be in working mode as well as in controller mode. The modes can change during operation. Node Controllers (by license and configuration) can only be in controller mode. Changing the operation mode of a node can occur automatically due to a detected failover, or it can be explicitly initiated by the user (in the Control Center or via HTTP).
Always the last Node Controller that goes online is the active Node Controller. Old Node Controllers shut down unless they previously were a Working Node by license and configuration. In this case, it changes back to a Working Node and does not shut down.
A valid operating state with a Node Controller and Working Node(s) must be reached at least once, because Working Nodes receive their setup from the Node Controller at startup.
If the connection of any Working Node to the Node Controller is interrupted, the failover is triggered. Note: See Heartbeat interval at parameter addNode and parameter externalURL.
Load Balancing service setup
To use the failover functionality, you must enable the load balancing service. This requires adjustments to the following configuration files.
Adjusting configuration file ./etc/factory.xml on all nodes of the cluster
On all nodes (Node Controller and Working Nodes) the "LoadBalanceService" (see below) must be activated in the configuration file ./etc/factory.xml . Important note: The entry must be inserted before the "StartupService" (see below), otherwise the failover cannot be accessed.
<?
xml
version
=
"1.0"
encoding
=
"ISO-8859-1"
?>
<!DOCTYPE Configure PUBLIC
"-//Lobster//DTD Configure 1.0//EN"
"
http://www.lobster.de/dtd/configure_1_1.dtd
">
<
Configure
class
=
"com.ebd.hub.services.ServiceFactory"
>
<!-- set a unique id for this factory - needed for cluster and load balance -->
<
Set
name
=
"id"
>factoryID</
Set
>
.
.
.
<!-- service for lb/ha cluster -->
<
Call
name
=
"addService"
>
<
Arg
>com.ebd.hub.services.ha.LoadBalanceService</
Arg
>
<
Arg
>etc/loadbalance.xml</
Arg
>
</
Call
>
.
.
.
<!-- service to start applications... -->
<
Call
name
=
"addService"
>
<
Arg
>com.ebd.hub.services.startup.StartupService</
Arg
>
<
Arg
>etc/startup.xml</
Arg
>
</
Call
>
Parameter |
Description |
id |
Required for the identification and loading of the configuration. Important note: Must be unique in a cluster. |
Adjusting configuration file ./etc/loadbalance.xml on all nodes of the cluster
On all nodes (Node Controller and Working Nodes) the file ./etc/loadbalance.xml (or the file specified in the configuration file ./etc/factory.xml for the LoadBalanceService) must be present and adjusted.
The same file is used on all nodes, which means you only need to create/adjust it once and then you can simply copy it to all existing nodes.
<?
xml
version
=
"1.0"
encoding
=
"ISO-8859-1"
?>
<!DOCTYPE Configure PUBLIC "-//Lobster//DTD Configure 1.0//EN"
"
http://www.lobster.de/dtd/configure_1_1.dtd
">
<
Configure
class
=
"com.ebd.hub.services.ha.LoadBalanceService"
>
<!-- If loadbalance should be enabled -->
<
Set
name
=
"enableLoadbalance"
>true</
Set
>
<!-- If failover should be enabled -->
<
Set
name
=
"enableFailover"
>true</
Set
>
<
Set
type
=
"int"
name
=
"controllerRequestSize"
>10</
Set
>
<
Set
type
=
"long"
name
=
"controllerThreshold"
>5000</
Set
>
<!-- Should nodes try to reconnect after lost connection, interval in seconds -->
<
Set
name
=
"reconnectInterval"
>1</
Set
>
<!-- Should a failover be done by shutting down the node controller -->
<
Set
name
=
"failoverAtShutdown"
>false</
Set
>
<!-- Configure when emails should be send -->
<
Set
name
=
"emailByLogoff"
>false</
Set
>
<
Set
name
=
"emailByLogin"
>false</
Set
>
<
Set
name
=
"emailByLostConnection"
>false</
Set
>
<!-- Add more recipients to inform via email -->
<!-- <
Call
name
=
"addEmail"
><
Arg
></
Arg
></
Call
> -->
<!-- Use external proxy service instead of externalUrl -->
<!-- <
Call
name
=
"setProxySettings"
>
<
Arg
>localhost</
Arg
>
<
Arg
type
=
"long"
>500</
Arg
>
<
Arg
type
=
"long"
>100</
Arg
>
<
Arg
type
=
"int"
>9006</
Arg
>
</
Call
> -->
<!-- Behaviour like old externalUrl -->
<
Call
name
=
"setExternalUrl"
>
<
Arg
>
https://www.google.de
</
Arg
>
<!-- Timeout for Http-Request -->
<
Arg
type
=
"int"
>1000</
Arg
>
</
Call
>
<!-- Own node information -->
<
Call
name
=
"addNode"
>
<!-- Nodename (has to be unique in cluster) -->
<
Arg
>FactoryID</
Arg
>
<!-- Hostname (DNS, IPv4, IPv6) -->
<
Arg
>ip-address</
Arg
>
<!-- Retries before removing node from cluster -->
<
Arg
type
=
"int"
>3</
Arg
>
<!-- Timeout for a read in ms -->
<
Arg
type
=
"long"
>300</
Arg
>
<!-- Heartbeat intervall to nodes in ms -->
<
Arg
type
=
"long"
>100</
Arg
>
<!-- Port for failover handling -->
<
Arg
type
=
"int"
>2320</
Arg
>
<!-- Preferred role in cluster (MASTER, SLAVE) -->
<
Arg
>MASTER</
Arg
>
<!-- Message port for load balancing -->
<
Arg
type
=
"int"
>8020</
Arg
>
</
Call
>
<!-- Own node information -->
<
Call
name
=
"addNode"
>
<!-- Nodename (has to be unique in cluster) -->
<
Arg
>FactoryID2</
Arg
>
<!-- Hostname (DNS, IPv4, IPv6) -->
<
Arg
>ip-address</
Arg
>
<!-- Retries before removing node from cluster -->
<
Arg
type
=
"int"
>3</
Arg
>
<!-- Timeout for a read in ms -->
<
Arg
type
=
"long"
>300</
Arg
>
<!-- Heartbeat intervall to nodes in ms -->
<
Arg
type
=
"long"
>100</
Arg
>
<!-- Port for failover handling -->
<
Arg
type
=
"int"
>2320</
Arg
>
<!-- Preferred role in cluster (MASTER, SLAVE) -->
<
Arg
>SLAVE</
Arg
>
<!-- Message port for load balancing -->
<
Arg
type
=
"int"
>8020</
Arg
>
</
Call
>
</
Configure
>
Parameter |
Description |
enableLoadbalance |
Activates the load balancing (and thus the distribution of jobs). |
enableFailover |
Activates the failover (i.e. the takeover of the role of the Node Controller by a Working Node in case of loss of the current Node Controller). |
controllerRequestSize |
You can set the number of jobs the Node Controller processes itself before it delegates jobs to Working Nodes. The value 0 will immediately delegate jobs to Working Nodes, whereas the value 10 will only start delegating jobs if the 11th job is running simultaneously. See also controllerThreshold. |
controllerThreshold |
The setting controllerRequestSize can further be restricted by considering the average runtime (statistic) of a job. The value 5000 will only consider jobs with an average runtime less than 5000 ms. Note: This setting can be modified during runtime in the Control Center! |
reconnectInterval |
If a connection (channel) between two nodes is lost, the service tries to re-establish it (self-healing) in the specified interval (in seconds). Note: All nodes (Node Controller and Working Nodes) are connected to all other nodes. There are two channels between two nodes. A single failed channel does not yet lead to a failover. Note: We recommend setting the reconnectInterval to the next higher second relative to the timeout. |
failoverAtShutdown |
Specifies whether a failover should be triggered when a Node Controller is shut down in a regular manner. |
emailByLogoff |
Determines whether an email is sent when a node logs out of another node. Attention: Results in increased email traffic. If a node logs out of an existing cluster of three nodes, three emails are sent. |
emailByLogin |
Determines whether an email is sent when a node logs into another node. Attention: Results in increased email traffic. If a node logs out of an existing cluster of three nodes, three emails are sent. |
emailByLostConnection |
Determines whether an email is sent when the connection to a node is lost. Attention: Results in increased email traffic. If a node logs out of an existing cluster of three nodes, three emails are sent. |
addEmail |
Additional email addresses can be added, which will be notified on emailByLogoff, emailByLogin and emailByLostConnection. The default email address is taken from the configuration file ./etc/startup.xml. |
setExternalUrl |
If this parameter is set, the specified URL is checked when the Node Controller is lost. If it can be reached, the Working Node will participate in determining the new Node Controller (raft consensus algorithm). If not, the Working Node shuts down. It is also checked, whether other Working Nodes can be reached. If they are reachable and can still ping the current Node Controller, the Working Node shuts down. Note: See also parameter setProxySettings . |
setProxySettings |
If an external proxy is configured, the external URL is not used. An external proxy can be used to prevent a "split brain", but is also a "single point of failure". Please contact support@lobster.de for more information or setup. Note: If none of the parameters setExternalUrl and setProxySettings is used, only a reachability check between the nodes is performed and responded with a failover or a shutdown. |
addNode |
An entry must be created for each node contained in the cluster (Node Controller and Working Nodes). Arguments: Nodename: Here, the value of the id parameter in the node's configuration file ./etc/factory.xml must be used. Hostname: URL or IP of the node. Retries: Number of heartbeat ping attempts before a failover is triggered. Timeout: The waiting time after a heartbeat ping in milliseconds. Heartbeat interval: The heartbeat ping interval in milliseconds. See also Timeout and Retries. Failover port: The port for the failover mechanism. Default: 2320 Preferred role: Use the value MASTER for a Node Controller and SLAVE for a Working Node. In a cluster there may only be one master, the number of slaves is unlimited. Message port for loadbalancing: The port for the load balancing mechanism. Default: 8020 |