Recognizing and Dealing with Critical System Conditions

Hardware Failures

Serious hardware failures, such as disk crashes, do not allow any prediction, except that data loss is very likely. The safeguards against this, from regular system and database backups to high availability through redundancy, are not covered here. They are the responsibility of the user. But see the section Concepts for Cold Standby.

Virtualization


In virtualized environments, instability of the virtualization platform can be just as disruptive as a hardware failure. Again, this is the responsibility of the user and is not the subject of this manual.

System Resources


A program like Lobster_data needs different system resources during its work. CPU time, free memory, free hard disk space, network connections, file handles, parallel programs (threads), etc. Depending on the operating system, the hardware, and the system configuration, there is always a limit to these resources. All programs that run concurrently compete for those resources. Depending on the type of resource, there may be temporary bottlenecks.

When a resource is exhausted, a critical system state occurs, in which the expected operation of the program is no longer possible. This situation must be prevented at all costs because if it is no longer possible to work with Lobster_data, you will also no longer be able to deal with errors effectively. This will then either block or crash the program.

Particular attention should be paid to three specific resources.

  • The memory that has been allocated to the Java Virtual Machine (JVM).

  • The hard disk space, especially the partition in which the Lobster Integration Server was installed, plus the system partition.

  • File handles are needed to read or write files and to establish network connections. A lack of file handles can lead to a deadlock.

If there is no more free main memory or hard disk space left, there is a risk of losing already received data. In Lobster_data, a large number of individual measures ensure that such data loss is virtually eliminated as long as the intended program execution is possible. Nevertheless, when an important system resource is exhausted, a low probability of data loss cannot be ruled out. Therefore, a full disk or an OutOfMemoryException must be avoided at all costs.

To monitor the disk space of selected partitions from a profile, you can use the function disk-free( [path a], [modus b,[unit c]]).

Deadlocks


Now to the deadlock problem, which occurs only rarely and only in heavily loaded systems. A hidden cause of an almost complete stop of processing may be the lack of free file handles. Each network connection requires a file handle. With strong network access, the file handles can, therefore, become scarce. To handle a request from the network, file handles are also needed, for example, to read a configuration file or to write temporary data to a temporary file. These operations will then wait for file handles to become available again. This can lead to an almost complete stop of the processing, which will only be resolved after a long time when the timeouts are reached. The longer the timeouts are set, the longer it will take. If the traffic from the network does not decrease during this time, the processing will immediately be blocked again.

Unfortunately, we cannot provide a general method for the automatic monitoring of file handles for the large number of operating systems we support.

The DMZ server is particularly vulnerable, because it may need to establish an internal connection for each external access. It would, therefore, be a completely wrong decision for a heavily loaded system to want to install the DMZ server on a workstation operating system.

Disruptions in the Network

Disruptions in the network limit the availability. Although the risk of data loss is low here, this, unfortunately, has a psychological effect. The processing goes back to almost zero, but at the same time the CPU load goes down as well and the memory usage shows no abnormalities. In this situation, most system administrators become restless. The obvious idea is to restart the entire Lobster Integration Server. However, the server will not shut down because the started processing has not terminated yet. The processing only ends after a timeout. If in this situation the processing is terminated, i.e. the running program is forced to abort, data loss is possible. Recommendation: In this situation, use the procedure described in section Fast Shutdown by File force_stop. This reduces the risk of data loss.