PdfToXMLPreparser

Configuration file

PdfToXMLPreparser.properties

Class name

com.ebd.hub.datawizard.parser.PdfToXMLPreparser

Description


This preparser is used to extract text and some other information from a PDF document and convert it to an XML file.

Parameters


onlyFormData=false
withDocumentInformation=true
withFormData=true
withImages=false
#textElementSeparator=:

Parameter

Description

onlyFormData

(optional) Specifies whether only form data should be extracted (and no other text of the PDF). Default: false. Important note: If true, parameter withFormData must also be true to extract form data.

withDocumentInformation

(optional) Specifies whether metadata should be extracted. Default: true.

withFormData

(optional) Specifies whether form data should be extracted. Default: true.

withImages

(optional) Specifies whether images should be extracted. Default: false.

textElementSeparator

(optional) Value to replace PDF control characters with.


The structure of the XML file generated by the preparser depends on the values of the parameters. To create a source structure matching your settings and the XML file generated for it, you can proceed as follows. We have already done this for you in the examples.

  • Create a new profile.

  • Select this preparser and specify the created configuration file.

  • Set checkbox "Result of preparser overrides backup file" (in "Main settings/Extensions").

  • Save the profile. It will be set to inactive because there is no source and target structure, but you can ignore that.

  • Start the profile with your PDF file. The profile will create an error, but you can ignore that too.

  • However, you can now use the input file of the profile job (in the Control Center) to create an XSD file from it and use it to automatically generate a source structure .

Examples


In the further course we will use the following simplified PDF file (with 'normal' text and form data): example.pdf

Example 1


First we only want to extract the form data from the PDF file. For this we use the following configuration file.


onlyFormData=true
withDocumentInformation=false
withFormData=true
withImages=false
#textElementSeparator=:


Example profile: Profile-PdfToXMLPreparser.pak

Example 2


Now we extract everything (except images), i.e. the form data and the 'normal' text.


onlyFormData=false
withDocumentInformation=true
withFormData=true
withImages=false
#textElementSeparator=:


Example profile: Profile-PdfToXMLPreparser_2.pak

Example 3


Perhaps you have already noticed it in the second example. Within a PDF page, the text lines are (for technical reasons) extracted backwards, i.e. from bottom to top.

If the order of the extracted data is important, you can proceed as follows. In the example profile, look at the additional target structure calculation field sort_field and the attributes "Sort field" and "Sorting" in target structure node LineData.

Example profile: Profile-PdfToXMLPreparser_3.pak