PDFPreParser
Configuration file |
./conf/samples/PDFPreParser.xml |
Class name |
com.ebd.hub.datawizard.parser.PDFPreParser |
Description
Even though the PDF format is not a format, which can be processed automatically, this preparser is able to extract at least some (selective) information from the file, that could, for example, be used to determine a specific Response for the file. The preparser cannot be used to extract complex or complete data from a PDF file.
First of all, the texts in the PDF are extracted. All the formatting and most of the layout will get lost in that process. The resulting plaintext is then processed according to the configured rules. The profile itself will receive a simple CSV file with key-value pairs, e.g.:
Profilename;PDFTestProfile
CustomerName;Miller
CusteomerNo;4711
OrderNo;12345
The first line always contains the key Profilename and the name of the profile. The rest of the content, which can also contain fixed values besides the extracted values, is defined by the user.
Structure of the configuration file
Please compare to the example file at the end of the page.
The root element
The root element of the configuration file always has to have the name "PDFPreParser", which itself has to contain an element Profile for each configured profile.
The "Profile" element
This element has to contain a subordinate element Name that is used to allocate this section of the configuration file to a certain profile. If a profile uses the PDFPreParser and cannot find a section with its name in the configuration file, the profile will abort with an error.
The subordinate elements LineFrom and LineTo are also compulsory and define the line range in which values are searched, e.g. from line 10 to 20. Whether empty lines are counted as well, can be set in the subordinate element IgnoreEmptyLines. If set to true, empty lines will be ignored, i.e. not be counted.
Optionally, the element FixValue allows defining an arbitrary number of fixed values and element Replace is capable of replacing a single character with another character, where Old and New define the codes (as integer) of these characters.
Which data is extracted from the PDF and how is configured in the Tag elements. A more detailed description of this is given below.
If form data is contained in the PDF it can be read with the setting WithFormData and value true. If only the form data should be read, the additional setting OnlyFormData with value true can be used.
The "Tag" element
Each data field (e.g. the customer number) is defined by one or more Tag elements with a subordinate element Name defining the name of the tag, e.g. CustomerNo. If there are several Tag elements with this name, the customer number can be found in several ways, which means it can be contained multiple times in the result. This might be useless in the case of the customer number, but an order could, for example, contain several article numbers, which should indeed be listed. How the dataist extracted is defined by the following rules.
LineNumber |
Only process the whole line with this number (line number in the initially extracted text). Can be combined with all other extraction rules, but lines not having this number will be ignored, even if another rule fits. |
LinesAfter |
Only processthe specified number of lines after the line in which the defined tag was found. All other lines will be ignored. Can be combined with all other extraction rules. Example: <LinesAfter Tag="CustomerNo">2</LinesAfter> processes the second line after the line in which the tag CustomerNo was found. |
BeginsAfter |
Use the text after this string. Can be combined with EndsBefore, Characters, and Words. |
EndsBefore |
Use the text before this string. Can be combined with BeginsAfter, Characters, and Words. |
FirstWord |
Take the words (separated by blanks) starting at the specified word number (offset: 1). Can be combined with LastWord or Words (LastWord is stronger than Words). |
LastWord |
Take the words (separated by blanks) until the specified word number (offset: 1). Can be combined with FirstWord or Words (FirstWord is stronger than Words). |
Words |
In combination with any of the previously mentioned. Returns the specified number of words (separated by blanks). Attribute Direction: The first (default) or last (value = last) words. |
Characters |
Can be combined with LineNumber, BeginsAfter, EndsBefore, and Words. Cannot be combined with FirstWord and LastWord. Returns the specified number of characters. Is stronger than Words, meaning that the specified number of characters is extracted first and then the words of that result are counted. Attribute Direction: The first (default) or last (value = last) characters. |
Trim |
Normally, extracted values are 'trimmed' (i.e. leading and trailing whitespaces are removed). This can be prevented with element Trim set to false. |
IgnoreCase |
If true, ignores case-sensitivity. |
Creation of the configuration file
To see the original plain text extracted from the PDF, on which the further extraction rules are executed, you can run a test in your profile and have a look in the log. To do so, you need at least a rudimentary configuration file, which then can be gradually extended, using this method.
First, we need a basic structure. Please use a regular text editor and make sure to save your file in the specified encoding. In this case, it is UTF8 (please always without BOM).
<?
xml
version
=
"1.0"
encoding
=
"UTF8"
?>
<
PDFPreParser
>
<
Profile
>
<
Name
>Your_profile_name</
Name
>
<
LineFrom
>1</
LineFrom
>
<
LineTo
>4</
LineTo
>
</
Profile
>
</
PDFPreParser
>
This file would make the PDFPreParser look at the first 4 lines (except empty lines). The rest would be ignored.
It is important to fill in the name of your test profile in the tag Name and define at least one Tag element. Otherwise, the profile will abort with an error.
<?
xml
version
=
"1.0"
encoding
=
"UTF8"
?>
<
PDFPreParser
>
<
Profile
>
<
Name
>Your_profile_name</
Name
>
<
LineFrom
>1</
LineFrom
>
<
LineTo
>4</
LineTo
>
<
Tag
>
<
Name
>TestElement</
Name
>
<
LineNumber
>1</
LineNumber
>
</
Tag
>
</
Profile
>
</
PDFPreParser
>
PDFs with different formats, that cannot be parsed with the same configuration, have to be processed by several profiles with suitable configurations.
Example file
<?
xml
version
=
"1.0"
encoding
=
"UTF8"
?>
<
PDFPreParser
>
<
Profile
>
<
Name
>Your_profile_name</
Name
>
<
LineFrom
>1</
LineFrom
>
<
LineTo
>4</
LineTo
>
<
FixValue
Name
=
"Fix1"
>1</
FixValue
>
<
FixValue
Name
=
"Fix2"
>2</
FixValue
>
<
FixValue
Name
=
"Fix3"
>3</
FixValue
>
<!-- Takes everything up to the second word of line 2.-->
<
Tag
>
<
Name
>CustomerName</
Name
>
<
LastWord
>2</
LastWord
>
<
LineNumber
>2</
LineNumber
>
</
Tag
>
<!-- This would replace all 'l' with 'f' and 'r' with 'e'.
<
Replace
>
<
Old
>108</
Old
>
<
New
>102</
New
>
</
Replace
>
<
Replace
>
<
Old
>114</
Old
>
<
New
>101</
New
>
</
Replace
>
<!--Option 1: Takes the last word after the text CustNo.
<
Tag
>
<
Name
>CustomerNo</
Name
>
<
BeginsAfter
>CustNo</
BeginsAfter
>
<
Words
Direction
=
"last"
>1</
Words
>
</
Tag
>
-->
<!--Option 2: Takes the last 4 characters after the word CustNo and trims them.
<
Tag
>
<
Name
>CustomerNo</
Name
>
<
BeginsAfter
>CustNo</
BeginsAfter
>
<
Trim
>true</
Trim
>
<
Characters
Direction
=
"last"
>4</
Characters
>
</
Tag
>
-->
<!--Takes the first word from the third line.
<
Tag
>
<
Name
>Street</
Name
>
<
LineNumber
>3</
LineNumber
>
<
Words
>1</
Words
>
</
Tag
>
-->
<!--Takes the last word from the third line.
<
Tag
>
<
Name
>StreetNo</
Name
>
<
LineNumber
>3</
LineNumber
>
<
Words
Direction
=
"last"
>1</
Words
>
</
Tag
>
-->
<!--The last word in between City: and Phone:
<
Tag
>
<
Name
>City</
Name
>
<
BeginsAfter
>City:</
BeginsAfter
>
<
EndsBefore
>Phone:</
EndsBefore
>
<
Words
Direction
=
"last"
>1</
Words
>
</
Tag
>
-->
<!--The first word of the first 10 characters in between City: and Phone:-->
<!--As a reminder: Characters are executed before words!!
<
Tag
>
<
Name
>ZIP</
Name
>
<
BeginsAfter
>City:</
BeginsAfter
>
<
EndsBefore
>Phone:</
EndsBefore
>
<
Characters
>10</
Characters
>
<
Words
>1</
Words
>
</
Tag
>
-->
<!--The fifth line after CustNo includes additional information, but just in the first 10 words.
<
Tag
>
<
Name
>AddInfo</
Name
>
<
LinesAfter
Tag
=
"CustNo"
>5</
LinesAfter
>
<
Words
>10</
Words
>
</
Tag
>
-->
</
Profile
>
</
PDFPreParser
>