ClusterFileDbPreparser
Configuration file |
./conf/samples/sample_ClusterFileDbPreparser.properties |
Class name |
com.ebd.hub.datawizard.parser.ClusterFileDbPreparser |
Description
This preparser sorts and groups vast amounts of input data, before parsing, by using a temporary database. Each job uses its own database to prevent interference. The preparser works in two steps.
Step 1
Input data is read line by line. A regular expression extracts the key from each line. If the key is found, the line is inserted into a database table "data". This table contains two columns, "name" for the key and "value" for the entire line.
Step 2
An SQL select statement retrieves all the lines from the database and sends them to the output stream of the preparser. In order to sort, or group the data rows by their keys, use the SQL clauses "ORDER BY" and/or "GROUP BY". Optionally, a separation line can be added (similar to the TokenFileSplitter) to allow the splitting of large amounts of data into several records.
The preparser expects a properties file containing the following parameters.
expression |
(mandatory) A regular expression selecting a specified section within each row of data, which serves as the key. |
stmt |
SQL statement for retrieving the ordered and/or grouped rows from the database. The field names name and value must be specified explicitly in the select statement. Default: SELECT name,value FROM data ORDER BY name |
header |
Text for the separation line to support parsing into multiple records. This separation line is inserted as the first line and is repeated each maxrows rows. |
maxrows |
Number of rows, after which a new separation line is inserted (default: before each group). When grouping by the key, the separation line is only inserted when the number of rows is reached and after the complete group was output. |
groupnumber |
(optional) Defines which group (from the left) is the key, if parameter expression contains more than one group. Default: 1 |
additional_groups |
This parameter is able to define a list of group numbers (separated by ,), for which a separate column will be created in the temporary database. The groups must be defined in the regular expression. The columns are given the names field1, field2, etc. |
verbose |
(optional) Activates the debug output in the log. Careful: The setting verbose=true may create a vast number of log entries, which might slow down the conversion significantly. Default: false |
Examples
Example 1
Suppose a CSV file with the field delimiter | contains the sorting/grouping key in the fourth column of each row.
expression=^.*?\\|.*?\\|.*?\\|(.*?)\\|.*$
The expression in brackets references the key. In order to extract the key subsequently from the regular expression, all quantifiers have to be marked as 'non-greedy'. This is done by the trailing question mark. Here, the quantifier * (for any or even no repetition) must be marked by a trailing question mark. Quantifiers after the key may be 'non-greedy'.
Note: To sort results with the same key by a second or even more criteria, the regular expression must contain several groups (.*?). In this case, it is possible to specify which group defines the key and which groups should be used for additional sorting, or grouping (parameters groupnumber and additional_groups). See the following example.
Example 2
Suppose a CSV file with the field delimiter | contains the key to be sorted by in the first column. Additional sort fields are the first and fourth column.
expression=^(.*?)\\|(.*?)\\|(.*?)\\|(.*?)\\|.*$
All columns up to the fifth column have to be defined as (.*?). To determine that the first column, thus the fourth group, is the key field, the parameter groupnumber=1 must be defined. The additional sort criteria ought to be the first column, thus the first group, and after that the fourth column, thus the fourth group. For that to happen, the parameter additional_groups=1,4 has to be defined. This way the fields field1 (for column 1) and field2 (for column 4) will be created in the table.
The stmt is: SELECT name,value FROM data ORDER BY field1,field2
Example file
expression=^(.*?)\\|(.*?)\\|(.*?)\\|(.*?)\\|.*$
stmt=select name,value from data order by field1,field2
#header=separator_line
#maxrows=0
groupnumber=1
additional_groups=1,4
#verbose=true
Example profile
In the example profile below, we would like to sort the following, simplified input data by article number and customer number (in 'real life', of course, these would be much more extensive).
Simply import the profile, test it and look at the configuration file of the preparser.
#Article_number|Description_product|Class|Customer_no|Special_price
12061_16|Melon Water|FRAGUM_S2|100033|31.2
12061_16|Melon Water|FRAGUM_S6|100097|27.6
12061_16|Melon Water|FRAGUM_S2|100194|31.2
12061_16|Melon Water|FRAGUM_S2|100206|31.2
12061_16|Melon Water|FRAGUM_S6|100284|27.6
12055_12|Melon Charentais|FRAGUM_S2|100033|22.8
12055_12|Melon Charentais|FRAGUM_S6|100097|21.6
12055_12|Melon Charentais|FRAGUM_S2|100194|22.8
12055_12|Melon Charentais|FRAGUM_S2|100206|22.8
12055_12|Melon Charentais|FRAGUM_S6|100284|21.6
12055_12|Melon Charentais|FRAGUM_S2|100332|22.8
12055_12|Melon Charentais|FRAGUM_S5|100364|21.6
12061_16|Melon Water|FRAGUM_S2|100332|31.2
12061_16|Melon Water|FRAGUM_S5|100364|27.6
12061_16|Melon Water|FRAGUM_S7|100392|25.2
12061_16|Melon Water|FRAGUM_S7|100446|25.2
12061_16|Melon Water|FRAGUM_S2|100667|31.2
12061_16|Melon Water|FRAGUM_S2|100768|31.2
12061_16|Melon Water|FRAGUM_S6|100777|27.6
12055_12|Melon Charentais|FRAGUM_S7|100392|19.8
12055_12|Melon Charentais|FRAGUM_S7|100446|19.8
12055_12|Melon Charentais|FRAGUM_S2|100667|22.8
12055_12|Melon Charentais|FRAGUM_S2|100768|22.8
12055_12|Melon Charentais|FRAGUM_S6|100777|21.6
12055_12|Melon Charentais|FRAGUM_S2|100780|22.8
Profile: Profile-Import_specialprice.pak
Input data: special_price.csv