Phase 2: Data Parsing

Did you read the chapter on memory? The issue with records? Great. Now let’s turn the tables.

CSV, Database, Excel and Fixed Length Input Formats

Methods that save memory are not necessarily the fastest. By setting up your structure to create many individual records, you will avoid the critical 'out of memory' error. This is always the better solution if a profile has very large volumes of data to process (or if there is a possibility of this occurring) or if multiple memory-intensive profiles could run simultaneously.

But what if you want to read a few thousand datasets from the database, but you have more than enough memory available and you can also control when the profile runs so that it does not have to fight with other profiles for memory? Then speed is likely to be more important to you. In this case, we advise you to use the flattest structure possible. This is because every hierarchical level in the tree, every additional node, takes a little time. Not much, but with a lot of datasets, it starts to add up. In the ideal situation, you will have no nodes at all, just fields.

images/download/thumbnails/21302747/Phase_2_Parsing_EN-version-2-modificationdate-1534987489000-api-v2.png

This is as fast as it gets. And by using a few more tricks during mapping, you can really turbocharge the process. More on this in the phase 3 section below. But think about your memory! Before you risk exhausting it, it is better to risk sacrificing a little bit of performance.

XML Input Format

Quite handily, the XML parser version V3 is not only more memory-friendly than version V2, it is also faster. So when you expect large volumes of data, simply use version V3 - and if worst comes to worst, version V4. This new XML parser was mentioned in the memory chapter, and it has a dedicated tutorial. Depending on the data situation, this parser is between 10 and 50 times faster than any other.

Other Formats

There is very little that can be tweaked here when it comes to parsing, but due to the comparably smaller volume of data (per job), this is not much of a problem. Deactivating unused nodes for EDIFACT and X12 formats has already been mentioned in the memory chapter.