TokenFileSplitter

Configuration file

./conf/samples/sample_splitter.properties

Class name

com.ebd.hub.datawizard.parser.TokenFileSplitter

Description


This preparser is able to place separating lines into a CSV or Fixed-length input file. The aim is to divide the file into several sections, to allow the parser to create several records. Since unused records can automatically be swapped to the hard drive, this preparser is mainly used to avoid memory overflow (OutOfMemoryException).

With the use of several parameters, it is possible to configure at what line an additional separating line will be added to the input file. You have to add an additional node as the first node of the source structure, with a match code matching the separating line. With these settings the regular CSV parser automatically creates a new record for each occurrence of the separating line.

It is guaranteed, that the separating line is always the first line in a record (also for the first block). The text in the separating line can be set with the parameter header. The parameter rows expects an integer value that defines the line count, after which the separating line is inserted. If this value is set to 0, the maximum value 2147483647 will be used. A very big value will practically produce one header line.

If after every rows lines the separating line is inserted, it is possible, that subsequent lines of the same record type are separated. To avoid this problem, you can define a regular expression that delays the insertion of the separating line, until it matches the current line, even if the row count is reached. After that, the row count is reset to 1. If rows=0 and an expression is defined, the separating line will not be inserted until the current line matches the expression, with the difference, that it is inserted after (and not before) the current line.

It may be desirable, that not all lines are considered, but only certain ones. You can define a regular expression in parameter filter. A line only appears in the output and increases the line count if it matches the expression.

Parameters


Parameter

Description

rows

(mandatory) Number of lines, after which the separating line is added.

header

(mandatory) Separating line to be added.

expression

Regular expression that delays the adding of the separating line until it matches the current line.

eol

Number defining the end of line characters. 0 is interpreted as \n, 1 as \r and all other values as \r\n.

filter

Regular expression that filters the input lines. Lines that do not match the expression are ignored (not output and not counted).

Example file


sample_splitter.properties
#
# sample file for TokenFileSplitter
#
# Supported keys are: rows, header, expression, eol
#
# rows = amount of rows that are combined for one record
# header = line that will be pasted into to indicate a new record
# expression = empty or a reg. expression that must match on current read line to create a new record (beside rows)
# eol=end of line (0=\n, 1 = \r, all other settings will be used for \r\n)
#
rows=10
header=new!