Design and implementation of ETL processes using BPMN and relational algebra
Design and implementation of ETL processes using BPMN and relational algebra
Archivos
Fecha
2020-06-13
Autores
Awiti, Judith
Vaisman, Alejandro Ariel
Zimányi, Esteban
Título de la revista
ISSN de la revista
Título del volumen
Editor
Resumen
"Extraction, transformation, and loading (ETL) processes are used to extract data from internal
and external sources of an organization, transform these data, and load them into a data
warehouse. The Business Process Modeling and Notation (BPMN) has been proposed for
expressing ETL processes at a conceptual level. A different approach is studied in this paper,
where relational algebra (RA), extended with update operations, is used for specifying ETL
processes. In this approach, data tasks in an ETL workflow can be automatically translated
into SQL queries to be executed over a DBMS. To illustrate this study, the paper addresses the
problem of updating Slowly Changing Dimensions (SCDs) with dependencies, that is, the case
when updating a SCD table impacts on associated SCD tables. Tackling this problem requires
extending the classic RA with update operations. The paper also shows the implementation
of a portion of the TPC-DI benchmark that results from both approaches. Thus, the paper
presents three implementations: (a) An SQL implementation based on the extended RA-based
specification of an ETL process expressed in BPMN4ETL; and (b) Two implementations of
workflows that follow from BPMN4ETL, one that uses the Pentaho DI tool, and another one
that uses Talend Open Studio for DI. Experiments over these implementations of the TPC-DI
benchmark for different scale factors were carried out, and are described and discussed in the
paper, showing that the extended RA approach results in more efficient processes than the ones
produced by implementing the BPMN4ETL specification over the mentioned ETL tools. The
reasons for this result are also discussed."