Phase 2: Parsing data (OutOfMemoryError)

Swapping mechanism


What does parsing data have to do with memory use? After all, it’s still the same data. That’s true. However, logically segmenting data can have a significant effect on memory use. The reason for this is the swapping mechanism.

Once the set percentage of the maximum allocated memory has been used, a profile with the above setting would transfer data to the hard drive - but only in whole records. The process of how these records are created for various document types is described in section When does the parser start a new record? The point of interest for us here is the following. If all of the data is loaded into a single record, the swapping mechanism will be undermined, because the record it is currently working on must be kept in the memory. If on the other hand, the data is split into multiple records, any records that are not currently processed can be swapped and save a lot of memory.

Splitting data


But sometimes even this is not enough. In this case, there is only one option: to artificially split the data into multiple records. This task is carried out by the EdifactSegmentSplitter, similarly to the TokenStreamSplitter for the document types CSV or Fixed-length. Broadly speaking, you specify a segment that introduces the frequently recurring segment group. For an INVRPT, this is the LIN segment from SG9 (version D12A). You also specify the maximum number of these that a record can contain. The preparser then groups together the specified number of LIN segments (along with their PIA segments, IMD segments, etc.) and adds a copy of everything that preceded them (UNB to SG8) to the top and a UNT to the bottom. This might not produce very clean intermediate data, but it can be processed very well with standard means. The most important point is that we now have numerous records and therefore no more memory problems.