HashMergePreparser

Configuration file

sample_HashMergePreparser.properties and sql_input_sample.properties (read in with Cron Job → File if mode=sql).

Class name

com.ebd.hub.datawizard.parser.HashMergePreparser

Description


This preparser is able to compare data from a CSV file, or a database table and checks if the data was modified, with the help of a temporary database. Therefore, a checksum (hash value) for a specified key will be created for an entire line of data and compared to existing entries (also key and hash value) in a database table. If the key does not exist yet, the data will be labelled as new. If a hash value for the key already exists, the two hash values will be compared. If the data has been modified, it will be labelled as updated. Otherwise, the data will not be passed on to the mapping and deleted from the file (respectively the result set of the database select).

Parameters


The following parameters are available for configuration.


Parameter

Description

debug

(optional) If true, a detailed log will be written. Default: false

delimiter

(optional) Defines the delimiter for the CSV file. Default: ,

encoding

(optional) Sets the encoding (for reading and writing). Default: utf8

excludeColumnsForHash

(optional) Columns can be specified that will be ignored when generating the hash value (index starting at 0). Example: excludeColumnsForHash=3,5,7

hashAlias

The database alias (as defined in configuration file database.xml) for the hash table.

hasHeadLine

(optional) Indicates whether the CSV file has a header line. Default: true

hashInsertStatement

The INSERT statement to insert new hash values.

hashMethod

(optional) Defines the hash calculation method (e.g. MD5, SHA1, etc). Default: MD5

hashSelectQuery

The SQL select statement (see below).

hashSelectQueryBatchSize

Number of lines written in one record.

hashUpdateStatement

The UPDATE statement to update existing hash values.

idColumnIndex

(important for mode csv) Defines in which column of the CSV file the key is located. If mode=sql, the property sourceIdColumn in configuration file sql_input_sample.properties is used for that purpose.

includeNew

(optional) If true, new entries will be passed on to the mapping. Default: true

includeOriginal

(optional) If true, unchanged entries will also be passed on to the mapping. Default: true

includeUpdated

(optional) If true, updated entries will be passed on to the mapping. Default: true

mode

Allowed values are csv and sql (see below).

quote

(optional) Sets the quote character. Default: "

splitHeaderToken

(optional) This separating line will be inserted if useFileSplitter=true. Default: header!

useFileSplitter

(optional) If true, inserts a separating line (parameter splitHeaderToken) after a certain number of lines (parameter hashSelectQueryBatchSize), to divide the file into several records. Default: true

writeHeadLine

(optional) Defines if an existing headline will be written in the created file. Default: true

Example

hashSelectQuery


The SQL select statement to check the stored hash values.

Important note: The statement must be created using the IN operator (see example file below).

CSV/SQL mode


Determines if the data will be read in by an SQL select statement, or from a CSV file. If mode=csv, the parameter idColumnIndex has to be set, to define which column contains the key value. If mode=sql, the property sourceIdColumn of configuration file sql_input_sample.properties is used for that purpose. Of course, if mode=sql is used, the database table from which the data is selected (here shipmentstatus) already has to exist and possibly be filled with data.

Note: If the Cron Job API class DefaultFileSQLCron is used, mode=csv has to be set.

sql_input_sample.properties


This file has to be configured, if the data to be checked is provided by an SQL select statement from an existing database (mode=sql). In that case, make sure to remember to provide the file sql_input_sample.properties (in addition to the configuration file sample_HashMergePreparser.properties) via Cron JobFile.

Result of the preparser


As a result, two additional columns action (values N=new, U=updated) and hash are created and put in front of the existing columns of the CSV file or the SQL result set (here from the table shipmentstatus). These additional columns must be added to the source structure. With parameters includeNew=false and includeUpdated=false, the adding of those additional columns can be prevented.

Create table for hash values


createTable.sql
CREATE TABLE shipmentstatus_hashvalue (
sendungsid VARCHAR(20) NOT NULL,
hashvalue VARCHAR(100) NOT NULL,
PRIMARY KEY (sendungsid)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
ROW_FORMAT=DEFAULT

Note: For MySQL.

Examples of the configuration files


sample_HashMergePreparser.properties
###################################################
# hash calculation method like MD5, SHA1, etc.
# default: MD5
hashMethod=MD5
###################################################
###################################################
# identfies if head line should be written, if available
#
# default=true
writeHeadLine=true
###################################################
###################################################
#include N lines in output csv
includeNew=true
#include U lines in output csv
includeUpdated=true
#exclude O lines from output csv
includeOriginal=false
###################################################
###################################################
# io encoding
# default: utf8
encoding=utf8
###################################################
###################################################
# log debug messages
# default=false
debug=false
###################################################
###################################################
# fetch mode: csv|sql|auto
#
# csv mode:
# Input is a CSV File which will be parsed
# sql mode:
# Input is a Properties File with sql config
# auto mode:
# tries sql mode first, then csv mode
# required
mode=sql
###################################################
###################################################
# csv i/o configuration
#
# Delimiter character
# default=,
delimiter=,
#
# Quote character
# default="
# quote=
#
# useFileSplitter
# insert splitHeaderToken before each 100 lines (hashSelectQueryBatchSize)
# default=true
useFileSplitter=true
#
# splitHeaderToken
# default=header!
splitHeaderToken=header!
###################################################
###################################################
# Hash storage settings
###################################################
# Datawizard database alias
# required
hashAlias=test
#
###################################################
# SQL select query to fetch the stored hash
# value by given id
# required, has to return a single column
hashSelectQueryBatchSize=100
hashSelectQuery=SELECT sendungsid, hashvalue FROM shipmentstatus_hashvalue WHERE sendungsid in ( @0:s@ )
###################################################
#
###################################################
# insert hash sql statement with 2 parameters
# not required
hashInsertStatement=INSERT INTO shipmentstatus_hashvalue ( sendungsid, hashvalue ) VALUES ( @0:s@, @1:s@ )
###################################################
#
###################################################
# update hash sql statement with 2 parameters
# not required
hashUpdateStatement=UPDATE shipmentstatus_hashvalue SET hashValue = @1:s@ WHERE sendungsid = @0:s@
###################################################
###################################################
# index of the id Column in csv mode
# required in csv input mode
idColumnIndex=1
###################################################
# identifies if csv source has a head column
# default=true
hasHeadLine=true
###################################################
sql_input_sample.properties
###################################################
# sql mode configuration file
###################################################
###################################################
# Specifies the count of rows in memory
# while fetching data
# default=100
pageSize=100
###################################################
###################################################
# Datawizard database alias
# required
sourceAlias=test
###################################################
###################################################
# result id column name
sourceIdColumn=sendungsid
###################################################
###################################################
# the sql query to fetch input data
# required
sourceQuery=SELECT * FROM shipmentstatus
###################################################