Memory versus performance
Now we know how each individual dataset also generates a record that can be easily swapped out. In extreme cases, you can process millions of data records without running out of memory.
The only drawback is that things will eventually become extremely sluggish. After all, individually swapping out hundreds of thousands or millions of records after parsing, swapping them back in for mapping, swapping out the result of the mapping process for that record and finally swapping the results back in ready for output - that’s a lot of accessing the hard drive. Eventually, it will slow down tremendously. Because of this, it is sometimes useful to divide the data into blocks. Let’s take the flat CSV structure as an example again, but with a few more datasets.
4711,Maier,Harald,Hamburg
4712,Müller,Hugo,Frankfurt
4713,Huber,Toni,Munich
4714,Schulz,Hans,Stuttgart
4715,Schmitz,Erwin,Hanover
4716,Peters,Hanne,Dortmund
What we want now is for each record to contain three datasets (in reality, you are more likely to want a hundred or even a thousand datasets per block). It would also be nice if the data looked like in the following file.
Separator
4711,Maier,Harald,Hamburg
4712,Müller,Hugo,Frankfurt
4713,Huber,Toni,Munich
Separator
4714,Schulz,Hans,Stuttgart
4715,Schmitz,Erwin,Hanover
4716,Peters,Hanne,Dortmund
Of course, this is not how the data will arrive. Neither your partners nor your database will oblige by adding the separator line. But we offer the convenient preparser TokenStreamSplitter (there is an older variant, the TokenFileSplitter, but the newer preparser performs better).
For this example, we will add only two values to the configuration.
rows=3
header=Separator
This means the following: Insert a line with text Separator at the start of the data and then after every three rows. Next, we will insert a node at the top of the source structure to react to the Separator line and save the text in a field. The Separator line will simply be ignored during output. Once again, the defined path for the output node Node is to SQL_node.
And now we have what we want. Every three datasets end up in one record.
In a realistic scenario, you would have closer to a million datasets, which you could split into records of one thousand. Or, if you have plenty of memory available, even records of ten thousand. This will stop your memory from being used up, as well as keeping the swapping processes in check.
This principle also works with fixed length data, but only if every dataset is in a separate line. If each dataset is merely the same length (e.g. 512 characters) with no line breaks, then your only option is individual records.