Phase 1: Receiving data (performance)

You can do a lot to speed things up even when it comes to receiving data. There might not be a great deal of individual improvement, but the cumulative effect does add up when you have dozens of active profiles.

Cron jobs: No more than necessary


Cron jobs are great. They regularly and independently check whether there is anything to do - and not only on the local hard drive; on any server worldwide using FTP, HTTP, SCP, POP3, IMAP, etc.

Theoretically, you could contact a server in Australia via FTP every five seconds and search for data to process. Whether it would make sense to do so is another matter. The processor is needed every few seconds for this FTP request, but it would also place a strain on the network. Quickly logging in, checking for data and logging back out again is not much. But it all adds up.

Certainly, there is data that must be processed as quickly as possible. But for this we offer event-driven Input Agents for FTP or HTTP, for example, that start a profile as soon as data is uploaded. If possible, active data feeds of this kind should be favoured in such time-critical cases.

In many other situations, however, active data uploads are not possible. It may simply be that your partner refuses, and you cannot dictate to a global corporation. So the question is, how quickly does the data really need to be processed? Do you actually need to check every ten seconds, or could it be ten minutes? Or perhaps there are even fixed times when data is supplied? Let’s take the following example.


images/download/attachments/189464002/370-version-1-modificationdate-1738746821701-api-v2.png


Your partner always uploads orders to a directory on their FTP server every evening between 5 pm and 6 pm. Sometimes it might take until 7 pm, but never later than that. Once the orders are there, however, they should be processed as soon as possible so that the lorries can set off early in the morning.

You could, of course, set up a cron job to check for data at one-minute intervals throughout the day. And it would serve no purpose whatsoever for around 23 hours. And there is never anything on the weekends anyway, so that is another two full days of pointless work for the cron job.

But you could also use a crontab rule to run checks every minute on weekdays between 5 pm and 7 pm, and every 15 minutes the rest of the time in case of the unlikely event that something appears outside of the usual pattern (or that the data is uploaded after 7 pm). We will leave out Sundays, since nothing happens then. Here’s how it would look:

The data will be retrieved from the server within a minute between 5 pm and 7 pm on weekdays (and Saturdays), and within a maximum of 15 minutes for the rest of these days. That should be enough, right?

Searches are of course much less critical when accessing the local hard drive, but bear in mind that mounted drives and directories also use network resources! And for every cron job like this, there is a thread running to constantly scan the directory for files.

But don’t go overboard with the efficiency! Supposing that data is supplied throughout the day and is anything but time-critical, surely once a day is enough? Fine, but we do not want to suddenly have to deal with a few hundred files in one go and tie up the system for an hour or more! It is better to clear out the accumulated files once an hour.

Spreading cron jobs


If your system always suffers a massive slump in performance or a complete overload at around the same time, you should check your cron jobs to see whether many of them are starting at the same time. This can quickly happen if you, for example, create a new profile as a copy of an old one, changing a few settings but leaving the cron job timing unchanged. Suddenly you have three dozen jobs starting at 8 am and fighting over resources, when most of them could actually have accessed their data at some point during the night and long since been finished.

And as we have just noted, running cron jobs too often uses up resources unnecessarily. But having too few runs, which then need to process huge volumes of data in one go is not right either. It is better to retrieve new data regularly to ensure that it can always be processed in manageable batches.

Of course, there is little you can do if your partners always upload their data at the same time. In this case, however, you should at least change the timing of those processes that you do have control over. What's the point if there are two hours of utilization per day and the rest of the time is idle?

Cron jobs for database queries


No doubt you read the chapter on memory problems very carefully. There we advised you to use the DefaultFileSQLCron class, because direct database queries and the DefaultSQLCron class both transfer data directly in the memory.

But what is good for the memory is not quite so good for performance. So, as long as you know that your select statement will produce relatively small amounts of data (a few hundred lines are absolutely fine), feel free to use a simple database Input Agent or the DefaultSQLCron, depending on your requirements. This will make the profile run faster, as the data does not have to be loaded back into the memory from a file. The performance increase will not be very significant in this case, but it should at least be mentioned.