XML Parser V4

Advantages of V4


The XML V4 parser (add-on module) offers significant performance gains with extremely low memory consumption. Compared to the XML-V3 parser, the memory requirement drops to about 10%. For input files up to about 100 MB, the runtime drops to about 5% compared to version 3. Further significant performance gains are possible via optional XPath filters.

In addition, extremely large XML input files of up to a maximum of 250 GB can be parsed. However, the runtime will then significantly increase again because of the necessary disk accesses.

Furthermore, the data from the XML input file can be combined into virtual elements (chunks) during parsing. Data outside the parsing realm can also be inserted into these chunks.

Preparser


If a preparser is used with the XML V4 parser, the backup file must be overwritten with the result of the preparser.


Settings


images/download/attachments/73599441/609.png images/download/attachments/73599441/610.png


(1) Must be set for V4.

(2) Specifies the tag name (element) below which you want to parse. An entry must be made here, even if the entire XML structure is to be parsed. If you want to parse a partial document, it must also conform to the XML convention (well-formed). Note: Can also be an XPath 1.0 expression, e.g. /inventory/books[@title="xxxx"]. Note: See also section Effect of an Entry in Field 'XML tag for record'.

(3) Must be set to V4.

(4) We do not want to create a new record per item. Instead, we want 2 item elements in each record, see (5). Therefore we create a new, artificial root element chunk. See section Generating Chunks below.

(5) The number of item elements per record, i.e. per chunk element, see (4). See section Generating Chunks below.

(6) We have two attributes date and ref in root element invoice. We need these in the mapping, so we activate this checkbox and all available attributes are copied into chunk (4).

(7) The specified elements, including their child elements, are copied into each chunk element (4). All XPath 1.0 expressions are allowed. See section Copy Attributes and Elements Redundantly into Each Record below.

(8) If the XML file is smaller than 2 GB, please uncheck this checkbox, otherwise set it. Automatically set if the file is larger than 2 GB, but then uses conservative parsing method Disk.

In Memory

If you have enough main memory, choose this option to do everything there.

Disk

Only a part of the XML file is loaded into the memory. This works much like a cursor select on databases.

Hybrid

Best 'choice of weapons'. The parsing takes place in mode Disk, but generated records are processed in memory. This reduces hard disk access and thus increases the processing speed.

(9) Incoming files can be checked with semantic rules. See section Semantic Check .

Example


We will use the following XML file: test_EN.xml


<?xml version="1.0" encoding="ISO-8859-1"?>
<invoice date="07.03.13" ref="R-0001">
<header>
<customer>Lobster</customer>
<address>
<name>Lobster GmbH</name>
<street>Münchner Str. 15a</street>
<zip>82319</zip>
<city>Starnberg</city>
</address>
</header>
<positions>
<item type="1" desc="billing">
<pos>1</pos>
<article id="A-001" name="Article 1" price="1050" amount="1" />
<note>Attention- Glas!</note>
</item>
<item type="0" desc="return">
<pos>2</pos>
<article id="A-002" name="Article 2" price="920" amount="2" />
</item>
<item type="1" desc="billing">
<pos>3</pos>
<article id="A-003" name="Article 3" price="90" amount="3" />
<note>See counter</note>
</item>
</positions>
<footer>
<note code="001">Complete</note>
</footer>
</invoice>


If repeated subelements (here item) are to lead to multiple records, you should not specify the real root element (invoice) of the XML document in (2) (see section Effect of an Entry in Field 'XML tag for record'). Previous parsers (prior to V4) lost all attributes of the root element and any parent or sibling elements that were not within the element item. With the XML parser V4, data from these 'blind' areas of the XML document can be copied into each record, as described in section Copy Attributes and Elements Redundantly into Each Record below.

Generating Chunks


If the structure of the input data would lead to a large number of small records, for example, because it contains several million item elements, the performance would suffer. In this case, it would make sense to combine several item elements in one record. However, there would be no natural structure in the input data to support that. However, you can force the parser to create a virtual element (4) (here chunk). This 'chunk' element then appears as a root element that contains several item elements.

The generation of chunks with (4) and (5) is optional. If (4) remains empty, a record is generated per (2) (in the example per item). In that case, the root node of the source structure should correspond to the item element.

If a virtual chunk element is used in (4), an additional root node corresponding to the chunk element (here chunk) must be used in the source structure. This node receives a match code as in (4) to parse the virtual chunk element (so Equals=chunk). And here the fitting source or target structure. Setter identifiers can be defined as usual, but they are already created when the source structure is created automatically. Use the file for_structure_EN.xml for this purpose: And here the complete profile: Profile-XML_V4_EN.pak



images/download/attachments/73599441/596.png

Copy Attributes and Elements Redundantly into Each Record


Suppose that we use element item as the XML tag for record for our source file above. The attributes of the element invoice and all data in element header will then be outside the parsed area. As of version V4, the attributes of the real root element can be transferred to every record generated, see checkbox (6). Similarly, in (7), you can enter those elements that are actually outside item but that you want to include in each record. The required adjustment of the profile source structure currently needs to be done manually (already done in our example structure).

Internal XML Based on the Prior Settings


Internally, based on our prior settings, the input XML looks like this.


Record 1


<?xml version="1.0" encoding="ISO-8859-1"?>
<chunk date="07.03.13" ref="R-0001">
<address>
<name>Lobster GmbH</name>
<street>Münchner Str. 15a</street>
<zip>82319</zip>
<city>Starnberg</city>
</address>
<item type="1" desc="billing">
<pos>1</pos>
<article id="A-001" name="Article 1" price="1050" amount="1" />
<note>Attention- Glas!</note>
</item>
<item type="0" desc="return">
<pos>2</pos>
<article id="A-002" name="Article 2" price="920" amount="2" />
</item>
<footer>
<note code="001">Complete</note>
</footer>
</chunk>


Record 2


<?xml version="1.0" encoding="ISO-8859-1"?>
<chunk date="07.03.13" ref="R-0001">
<address>
<name>Lobster GmbH</name>
<street>Münchner Str. 15a</street>
<zip>82319</zip>
<city>Starnberg</city>
</address>
<item type="1" desc="billing">
<pos>3</pos>
<article id="A-003" name="Article 3" price="90" amount="3" />
<note>See counter</note>
</item>
<footer>
<note code="001">Complete</note>
</footer>
</chunk>