PdfToXMLPreparser

Group

Preparsers

Class Name

com.ebd.hub.datawizard.parser.PdfToXMLPreparser

Function

This preparser is used to extract text and some other information from a PDF document and convert it to an XML file.

Configuration File

PdfToXMLPreparser.properties

Description


This preparser is used to extract text and some other information from a PDF document and convert it to an XML file. The following parameters are available.


onlyFormData=false
withDocumentInformation=true
withFormData=true
withImages=false
#textElementSeparator=:

Parameter

Description

onlyFormData

(optional) Specifies whether only form data should be extracted (and no other text of the PDF). Default: false. Important note: If true, parameter withFormData must also be true to extract form data.

withDocumentInformation

(optional) Specifies whether metadata should be extracted. Default: true.

withFormData

(optional) Specifies whether form data should be extracted. Default: true.

withImages

(optional) Specifies whether images should be extracted. Default: false.

textElementSeparator

(optional) Value to replace PDF control characters with.


The structure of the XML file generated by the preparser depends on the values of the parameters. To create a source structure matching your settings and the XML file generated for it, you can proceed as follows. We have already done this for you in the examples.

Examples


In the further course we will use the following simplified PDF file (with 'normal' text and form data): example.pdf

Example 1


First we only want to extract the form data from the PDF file. For this we use the following configuration file.


onlyFormData=true
withDocumentInformation=false
withFormData=true
withImages=false
#textElementSeparator=:


Example profile: Profile-PdfToXMLPreparser.pak

Example 2


Now we extract everything (except images), i.e. the form data and the 'normal' text.


onlyFormData=false
withDocumentInformation=true
withFormData=true
withImages=false
#textElementSeparator=:


Example profile: Profile-PdfToXMLPreparser_2.pak

Example 3


Perhaps you have already noticed it in the second example. Within a PDF page, the text lines are (for technical reasons) extracted backwards, i.e. from bottom to top.

If the order of the extracted data is important, you can proceed as follows. In the example profile, look at the additional destination structure calculation field sort_field and the attributes Sort field and Sorting in destination structure node LineData.

Example profile: Profile-PdfToXMLPreparser_3.pak