Examples (ETL/ELT)
Each ETL/ELT pipeline consists of at least one ETL/ELT task.
Each task has a source and a destination.
One row of source data is read in at a time, no matter what the data source is. From this, one row of destination data is generated until the source data has been processed.
Example 1 (ETL/ELT source fields)
Let's assume a pipeline with one task.
We use the following simple CSV file (with header row) as data source and also want to write into a CSV file in the destination.
Firstname,Lastname,Points
Robert,Hammer,55
We create a new ETL/ELT pipeline (1) with the name My_pipeline_1_EN and leave all default settings.
Then we create a new ETL/ELT task (2) with the name my_task_1 and jump directly into the source data (3) and set it to type CSV/Text.
There we enter the path (4) to our CSV file and set the checkbox (5) to indicate that our file contains a header row. We enter a comma as the separator (6).
As you can see below (4), the column names have already been recognised by setting (5).
We remember that one row at a time is read in from the source data. This row is now internally broken down into so-called ETL/ELT source fields. The names of these ETL/ELT source fields are taken from the header line and are $Firstname, $Lastname and $Points.
Now we jump directly to the destination data (7) to build our destination data row there with our ETL/ELT source fields.
We also enter CSV/Text as the destination data type (8) and enter the name and path to the file to be generated (9).
Then we select our ETL/ELT source fields (10) and move them to the right with (11). With this we have already defined the destination data row to be generated: $Firstname,$Lastname,$Points
Now click on (12). Another dialogue appears.
Leave (13) as in the screenshot, we only have one row of data in our source file anyway, and then click on (14).
As you will see in your file system, you have already created your first file.
Later you can change details such as separators, headers, quoting character in the settings, but we won't dwell on that for now.
Import:
Example 2 (Custom ETL/ELT fields)
In this example, we use the pipeline from example 1 as a basis. There we showed how to build the destination data from the ETL/ELT source fields read in from the source data.
Now we want to show how you can define and use custom ETL/ELT fields. To do this, we switch to the corresponding tab and create a new field via the context menu.
As value we want to give the custom ETL/ELT field the value of the ETL/ELT source field Points coming from the source data and add 20.
(1) The name of our new custom ETL/ELT field.
(2) We use a function to calculate the field value.
(3) Move the desired function to the right with the arrow, as usual.
(4) There is a specific parameter type for ETL/ELT source fields. You can then select the desired field, here Points.
Now we switch to the destination data tab. We select our custom ETL/ELT field (5) instead of the ETL/ELT source field Points (6) and move it to the right to use it as the third value of our destination data row (7).
Do a test run as shown in example 1 and see the result.
Import:
Example 3 (ETL/ELT variables)
We again use the pipeline from example 1 as a basis. In the destination data row, however, we now use an ETL/ELT variable instead of an ETL/ELT source field or a custom ETL/ELT field. To do this, we first create a new variable in the corresponding tab and give it a value.
Then we use this variable in our destination data row.
Do a simple test run as shown in example 1 and see the result. Note: In this example, we have given the variable an initial value, but the value of the variable could also be passed by a workflow, for example, when it starts the pipeline.
Import:
Example 4 (Child task)
In this example we want to show the interaction between parent and child tasks.
To do this, we use the parent task my_task_1, in which we read in a CSV file with several rows and do not generate any destination data. The child task my_task_1.1 is then executed as often as the CSV file has rows. In the child task, we ignore the source data and create one destination data row. In the destination data row, we use the ETL/ELT source fields from the parent task (which we have access to in the child task). This way we create as many destination files as the original source file in the parent task has rows.
For the names of the generated files, we use a counter variable that we increment in the conditions of the child task's destination data (see the construct there).
Do a simple test run as shown in example 1 and see the result.
Import:
Example 5 (Sibling task, clipboard)
In this example we want to show the interaction between sibling tasks.
In task my_task_1 we read in a CSV file, do not create a destination file and write all ETL/ELT source fields to the clipboard because we want to use them in sibling task my_task_2 but do not have direct access to them there. In sibling task my_task_2 we then read in the clipboard in the source data and then create the destination file from it.
Do a simple test run (with 3 rows) as shown in example 1 and see the result.
Import:
Example 6 (Practical example currency conversion)
In this example, we read in a file with a list of order items and create an identical file from it, but the item prices are converted from Euros to US dollars. The following is the source file.
item_number,price,quantity
102788,17.99,2
116700,4.49,3
210098,23.77,1
(1) In the first task we only get the conversion rate from Euros to US Dollars.
(2) To do this, we use a function chain in the conditions. We fetch the conversion rate via HTTP call, which in this example only retrieves a random number. We then store this value in an ETL/ELT variable.
(3) In a second task we then read the input file. We also define a custom ETL/ELT field. The value of this field is the read in item price multiplied by the conversion rate in our ETL/ELT variable.
(4) In the destination data we then use the ETL/ELT source fields, but for the price we use our custom ETL/ELT field instead.
Import:
Example 7 (Practical example duplicate handling)
Let's assume we want to read in a CSV file and make sure we remove any duplicates it may contain before writing to any destination (in the example, another CSV file).
ID,Name
1,Peter
2,Paul
1,Peter
3,Michael
As can be seen, there are two entries with the same ID in the sample file. To clean up duplicates of this type, we proceed as follows.
For each row to be written, we define conditions (1) that tell us if this row contains an ID that has already been written. In the function chain defined there, we return true if the ID does not yet exist in an internal ID list and then add the ID to the list. If the ID already exists in the list, we return false.
If true, the primary data destination (2) is used and we write the read row (3) there.
If false, we execute a separate action (4) for this row, which is configured via (5) (see the following screenshot), i.e. the primary data destination is ignored.
We define the secondary data destination in (6), where we write the read in data (7) instead.
So our cleaned up data is written into the destination file data_out.csv and duplicates end up in the destination file duplicates_out.csv.
Import:
If we write the read in data into a database, there is a more elegant solution if the ID is the primary key there. When trying to write a line with an ID into the database, which already exists there, an SQL error will occur. We can catch this error with the error handling and execute a separate action there analogously.
Example 8 (Practical example: Direct bulk insert file→ database)
This example shows how to write data directly from a text file into a database. This provides a significant processing speed advantage and can be used when no changes need to be made to the data. The database used in this example is MySQL (for other databases please refer to the documentation of the respective database manufacturer for similar functions). Note: See also https://dev.mysql.com/doc/refman/8.0/en/load-data.html .
(1) ETL/ELT field containing the file path and name.
(2) Use of the ETL/ELT field from (1).
Analogously, an XML file can be processed. Note: See also https://dev.mysql.com/doc/refman/8.0/en/load-xml.html.