Phase 1: Loading Data (OutOfMemoryError)
Database Queries
A memory overload when retrieving data is something that often happens with profiles that read data from your database. Let’s start with the simplest case. You use the time-driven Input Agent DB with a simple select statement. Lobster_data now does the following:
It executes the select statement, loads all of the result data from the database, and then saves it as a CSV file (this also produces the job_xyz file, which you can select for restarts). Afterwards, the data is loaded like any other CSV file. Because of this, we strongly recommend that you also read the tip later on in this guide for saving memory when loading CSV data.
What puts such a strain on the memory when retrieving data from the database is that all of the selected data must be fully loaded into the memory before being written to the hard drive and transferred to the profile for processing while still in the memory. The process is similar if you have set the time-driven Input Agent Custom Class instead with class DefaultSQLCron.
Again, the data is immediately transferred entirely to the memory. Although this is fast, it is also quite memory-intensive. A few hundred or even a few thousand datasets should be no problem for either of the processes listed above. However, once the number of datasets gets closer to the tens of thousands, you need a much more reliable solution: Class DefaultFileSQLCron. It essentially works in exactly the same way as class DefaultSQLCron, but with two key differences.
It uses cursor selects.
Instead of being collected in the memory, the data is written directly to the CSV file.
So, if you expect large volumes of data from the database, better use the class DefaultFilesSQLCron. And don’t forget to also read the tips for parsing large CSV files later on in this guide.
Distribute the Load
Even if a single profile manages just fine with the memory on its own, problems may arise when multiple memory-intensive profiles are running at once. This demand for memory can be greatly reduced in subsequent phases, but even Lobster_data has its limits (which are imposed by the system).
You should, therefore, do your best to ensure that large volumes of data, at least, are not all processed at the same time. Because this also has a major impact on performance, load equalising is covered in more detail in section Performance.