HashMergePreparser
Configuration file |
sample_HashMergePreparser.properties and sql_input_sample.properties (read in with Cron Job → File if mode=sql). |
Class name |
com.ebd.hub.datawizard.parser.HashMergePreparser |
Description
This preparser is able to compare data from a CSV file, or a database table and checks if the data was modified, with the help of a temporary database. Therefore, a checksum (hash value) for a specified key will be created for an entire line of data and compared to existing entries (also key and hash value) in a database table. If the key does not exist yet, the data will be labelled as new. If a hash value for the key already exists, the two hash values will be compared. If the data has been modified, it will be labelled as updated. Otherwise, the data will not be passed on to the mapping and deleted from the file (respectively the result set of the database select).
Parameters
The following parameters are available for configuration.
Parameter |
Description |
debug |
(optional) If true, a detailed log will be written. Default: false |
delimiter |
(optional) Defines the delimiter for the CSV file. Default: , |
encoding |
(optional) Sets the encoding (for reading and writing). Default: utf8 |
excludeColumnsForHash |
(optional) Columns can be specified that will be ignored when generating the hash value (index starting at 0). Example: excludeColumnsForHash=3,5,7 |
hashAlias |
The database alias (as defined in configuration file database.xml) for the hash table. |
hasHeadLine |
(optional) Indicates whether the CSV file has a header line. Default: true |
hashInsertStatement |
The INSERT statement to insert new hash values. |
hashMethod |
(optional) Defines the hash calculation method (e.g. MD5, SHA1, etc). Default: MD5 |
hashSelectQuery |
The SQL select statement (see below). |
hashSelectQueryBatchSize |
Number of lines written in one record. |
hashUpdateStatement |
The UPDATE statement to update existing hash values. |
idColumnIndex |
(important for mode csv) Defines in which column of the CSV file the key is located. If mode=sql, the property sourceIdColumn in configuration file sql_input_sample.properties is used for that purpose. |
includeNew |
(optional) If true, new entries will be passed on to the mapping. Default: true |
includeOriginal |
(optional) If true, unchanged entries will also be passed on to the mapping. Default: true |
includeUpdated |
(optional) If true, updated entries will be passed on to the mapping. Default: true |
mode |
Allowed values are csv and sql (see below). |
quote |
(optional) Sets the quote character. Default: " |
splitHeaderToken |
(optional) This separating line will be inserted if useFileSplitter=true. Default: header! |
useFileSplitter |
(optional) If true, inserts a separating line (parameter splitHeaderToken) after a certain number of lines (parameter hashSelectQueryBatchSize), to divide the file into several records. Default: true |
writeHeadLine |
(optional) Defines if an existing headline will be written in the created file. Default: true |
Example
hashSelectQuery
The SQL select statement to check the stored hash values.
Important note: The statement must be created using the IN operator (see example file below).
CSV/SQL mode
Determines if the data will be read in by an SQL select statement, or from a CSV file. If mode=csv, the parameter idColumnIndex has to be set, to define which column contains the key value. If mode=sql, the property sourceIdColumn of configuration file sql_input_sample.properties is used for that purpose. Of course, if mode=sql is used, the database table from which the data is selected (here shipmentstatus) already has to exist and possibly be filled with data.
Note: If the Cron Job API class DefaultFileSQLCron is used, mode=csv has to be set.
sql_input_sample.properties
This file has to be configured, if the data to be checked is provided by an SQL select statement from an existing database (mode=sql). In that case, make sure to remember to provide the file sql_input_sample.properties (in addition to the configuration file sample_HashMergePreparser.properties) via Cron Job → File.
Result of the preparser
As a result, two additional columns action (values N=new, U=updated) and hash are created and put in front of the existing columns of the CSV file or the SQL result set (here from the table shipmentstatus). These additional columns must be added to the source structure. With parameters includeNew=false and includeUpdated=false, the adding of those additional columns can be prevented.
Create table for hash values
CREATE
TABLE
shipmentstatus_hashvalue (
sendungsid
VARCHAR
(20)
NOT
NULL
,
hashvalue
VARCHAR
(100)
NOT
NULL
,
PRIMARY
KEY
(sendungsid)
)
COLLATE
=
'utf8_general_ci'
ENGINE=InnoDB
ROW_FORMAT=
DEFAULT
Note: For MySQL.
Examples of the configuration files
###################################################
# hash calculation method like MD5, SHA1, etc.
# default: MD5
hashMethod=MD5
###################################################
###################################################
# identfies if head line should be written, if available
#
# default=true
writeHeadLine=true
###################################################
###################################################
#include N lines in output csv
includeNew=true
#include U lines in output csv
includeUpdated=true
#exclude O lines from output csv
includeOriginal=false
###################################################
###################################################
# io encoding
# default: utf8
encoding=utf8
###################################################
###################################################
# log debug messages
# default=false
debug=false
###################################################
###################################################
# fetch mode: csv|sql|auto
#
# csv mode:
# Input is a CSV File which will be parsed
# sql mode:
# Input is a Properties File with sql config
# auto mode:
# tries sql mode first, then csv mode
# required
mode=sql
###################################################
###################################################
# csv i/o configuration
#
# Delimiter character
# default=,
delimiter=,
#
# Quote character
# default="
# quote=
#
# useFileSplitter
# insert splitHeaderToken before each 100 lines (hashSelectQueryBatchSize)
# default=true
useFileSplitter=true
#
# splitHeaderToken
# default=header!
splitHeaderToken=header!
###################################################
###################################################
# Hash storage settings
###################################################
# Datawizard database alias
# required
hashAlias=test
#
###################################################
# SQL select query to fetch the stored hash
# value by given id
# required, has to return a single column
hashSelectQueryBatchSize=100
hashSelectQuery=SELECT sendungsid, hashvalue FROM shipmentstatus_hashvalue WHERE sendungsid in ( @0:s@ )
###################################################
#
###################################################
# insert hash sql statement with 2 parameters
# not required
hashInsertStatement=INSERT INTO shipmentstatus_hashvalue ( sendungsid, hashvalue ) VALUES ( @0:s@, @1:s@ )
###################################################
#
###################################################
# update hash sql statement with 2 parameters
# not required
hashUpdateStatement=UPDATE shipmentstatus_hashvalue SET hashValue = @1:s@ WHERE sendungsid = @0:s@
###################################################
###################################################
# index of the id Column in csv mode
# required in csv input mode
idColumnIndex=1
###################################################
# identifies if csv source has a head column
# default=true
hasHeadLine=true
###################################################
###################################################
# sql mode configuration file
###################################################
###################################################
# Specifies the count of rows in memory
# while fetching data
# default=100
pageSize=100
###################################################
###################################################
# Datawizard database alias
# required
sourceAlias=test
###################################################
###################################################
# result id column name
sourceIdColumn=sendungsid
###################################################
###################################################
# the sql query to fetch input data
# required
sourceQuery=SELECT * FROM shipmentstatus
###################################################