Phase 2: Parsing Data (OutOfMemoryError)
Swapping Mechanism
What does parsing data have to do with memory use? After all, it’s still the same data. That’s true. However, logically segmenting data can have a significant effect on memory use. The reason for this is the Lobster_data swapping mechanism.
Once Lobster_data has used the set percentage of the maximum allocated memory, a profile with the above setting would transfer data to the hard drive - but only in whole records. The process of how these records are created for various document types is described in section When Does the Parser Start a New Record? The point of interest for us here is the following. If Lobster_data loads all of the data into a single record, the swapping mechanism will be undermined, because the record it is currently working on must be kept in the memory. If on the other hand, Lobster_data is able to split the data into multiple records, it can swap out any that are not currently processed and save a lot of memory.
Splitting Data
But sometimes even this is not enough. In this case, there is only one option: to artificially split the data into multiple records. This task is carried out by the EdifactSegmentSplitter, similarly to the TokenStreamSplitter for the document types CSV or Fixed-length. Broadly speaking, you specify a segment that introduces the frequently recurring segment group. For an INVRPT, this is the LIN segment from SG9 (version D12A). You also specify the maximum number of these that a record can contain. The preparser then groups together the specified number of LIN segments (along with their PIA segments, IMD segments, etc.) and adds a copy of everything that preceded them (UNB to SG8) to the top and a UNT to the bottom. This might not produce very clean 'intermediate data', but they can be processed very well with Lobster_data means. The most important point is that we now have numerous records and therefore no more memory problems.