Configuration file	sample_HashMergePreparser.properties and sql_input_sample.properties (read in with Cron Job → File if mode=sql).
Class name	com.ebd.hub.datawizard.parser.HashMergePreparser

Description

This preparser is able to compare data from a CSV file, or a database table and checks if the data was modified, with the help of a temporary database. Therefore, a checksum (hash value) for a specified key will be created for an entire line of data and compared to existing entries (also key and hash value) in a database table. If the key does not exist yet, the data will be labelled as new. If a hash value for the key already exists, the two hash values will be compared. If the data has been modified, it will be labelled as updated. Otherwise, the data will not be passed on to the mapping and deleted from the file (respectively the result set of the database select).

Parameters

The following parameters are available for configuration.

Parameter	Description
debug	(optional) If true, a detailed log will be written. Default: false
delimiter	(optional) Defines the delimiter for the CSV file. Default: ,
encoding	(optional) Sets the encoding (for reading and writing). Default: utf8
excludeColumnsForHash	(optional) Columns can be specified that will be ignored when generating the hash value (index starting at 0). Example: excludeColumnsForHash=3,5,7
hashAlias	The database alias (as defined in configuration file database.xml) for the hash table.
hasHeadLine	(optional) Indicates whether the CSV file has a header line. Default: true
hashInsertStatement	The INSERT statement to insert new hash values.
hashMethod	(optional) Defines the hash calculation method (e.g. MD5`,` SHA1, etc). Default: MD5
hashSelectQuery	The SQL select statement (see below).
hashSelectQueryBatchSize	Number of lines written in one record.
hashUpdateStatement	The UPDATE statement to update existing hash values.
idColumnIndex	(important for mode csv) Defines in which column of the CSV file the key is located. If mode=sql, the property sourceIdColumn in configuration file sql_input_sample.properties is used for that purpose.
includeNew	(optional) If true, new entries will be passed on to the mapping. Default: true
includeOriginal	(optional) If true, unchanged entries will also be passed on to the mapping. Default: true
includeUpdated	(optional) If true, updated entries will be passed on to the mapping. Default: true
mode	Allowed values are csv and sql (see below).
quote	(optional) Sets the quote character. Default: "
splitHeaderToken	(optional) This separating line will be inserted if useFileSplitter=true. Default: header!
useFileSplitter	(optional) If true, inserts a separating line (parameter splitHeaderToken) after a certain number of lines (parameter hashSelectQueryBatchSize), to divide the file into several records. Default: true
writeHeadLine	(optional) Defines if an existing headline will be written in the created file. Default: true

Example

hashSelectQuery

The SQL select statement to check the stored hash values.

Important note: The statement must be created using the IN operator (see example file below).

CSV/SQL mode

Determines if the data will be read in by an SQL select statement, or from a CSV file. If mode=csv, the parameter idColumnIndex has to be set, to define which column contains the key value. If mode=sql, the property sourceIdColumn of configuration file sql_input_sample.properties is used for that purpose. Of course, if mode=sql is used, the database table from which the data is selected (here shipmentstatus) already has to exist and possibly be filled with data.

Note: If the Cron Job API class DefaultFileSQLCron is used, mode=csv has to be set.

sql_input_sample.properties

This file has to be configured, if the data to be checked is provided by an SQL select statement from an existing database (mode=sql). In that case, make sure to remember to provide the file sql_input_sample.properties (in addition to the configuration file sample_HashMergePreparser.properties) via Cron Job → File.

Result of the preparser

As a result, two additional columns action (values N=new, U=updated) and hash are created and put in front of the existing columns of the CSV file or the SQL result set (here from the table shipmentstatus). These additional columns must be added to the source structure. With parameters includeNew=false and includeUpdated=false, the adding of those additional columns can be prevented.

Create table for hash values

createTable.sql

CREATE TABLE shipmentstatus_hashvalue (
  sendungsid VARCHAR(20) NOT NULL,
  hashvalue VARCHAR(100) NOT NULL,
  PRIMARY KEY (sendungsid)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
ROW_FORMAT=DEFAULT

Note: For MySQL.

Examples of the configuration files

sample_HashMergePreparser.properties

###################################################
# hash calculation method like MD5, SHA1, etc.
# default: MD5
hashMethod=MD5
###################################################
 
###################################################
# identfies if head line should be written, if available
# 
# default=true
writeHeadLine=true
###################################################
 
###################################################
#include N lines in output csv
includeNew=true
#include U lines in output csv
includeUpdated=true
#exclude O lines from output csv 
includeOriginal=false
###################################################
 
###################################################
# io encoding
# default: utf8
encoding=utf8
###################################################
 
###################################################
# log debug messages
# default=false
debug=false
###################################################
 
###################################################
# fetch mode: csv|sql|auto
#
# csv mode:
# 	Input is a CSV File which will be parsed
# sql mode:
# 	Input is a Properties File with sql config
# auto mode:
#   tries sql mode first, then csv mode
# required 
mode=sql
###################################################
 
###################################################
# csv i/o configuration
#
# Delimiter character
# default=,
delimiter=,
#
# Quote character
# default="
# quote=
#
# useFileSplitter
# insert splitHeaderToken before each 100 lines (hashSelectQueryBatchSize)
# default=true
useFileSplitter=true
#
# splitHeaderToken
# default=header!
splitHeaderToken=header!
###################################################
 
###################################################
# Hash storage settings
###################################################
# Datawizard database alias
# required
hashAlias=test
#
###################################################
# SQL select query to fetch the stored hash
# value by given id
# required, has to return a single column
hashSelectQueryBatchSize=100
hashSelectQuery=SELECT sendungsid, hashvalue FROM shipmentstatus_hashvalue WHERE sendungsid in ( @0:s@ )
###################################################
#
###################################################
# insert hash sql statement with 2 parameters
# not required
hashInsertStatement=INSERT INTO shipmentstatus_hashvalue ( sendungsid, hashvalue ) VALUES ( @0:s@, @1:s@ )
###################################################
#
###################################################
# update hash sql statement with 2 parameters
# not required
hashUpdateStatement=UPDATE shipmentstatus_hashvalue SET hashValue = @1:s@ WHERE sendungsid = @0:s@
###################################################
 
 
###################################################
# index of the id Column in csv mode
# required in csv input mode
idColumnIndex=1
###################################################
# identifies if csv source has a head column
# default=true
hasHeadLine=true
###################################################

sql_input_sample.properties

###################################################
# sql mode configuration file
###################################################
 
###################################################
# Specifies the count of rows in memory
# while fetching data
# default=100
pageSize=100
###################################################
 
###################################################
# Datawizard database alias
# required
sourceAlias=test
###################################################
 
###################################################
# result id column name
sourceIdColumn=sendungsid
###################################################
 
###################################################
# the sql query to fetch input data
# required
sourceQuery=SELECT * FROM shipmentstatus
###################################################