artículo de publicación periódica.page.titleprefix Design and implementation of ETL processes using BPMN and relational algebra
Loading...
Date
2020-06-13
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
"Extraction, transformation, and loading (ETL) processes are used to extract data from internal
and external sources of an organization, transform these data, and load them into a data
warehouse. The Business Process Modeling and Notation (BPMN) has been proposed for
expressing ETL processes at a conceptual level. A different approach is studied in this paper,
where relational algebra (RA), extended with update operations, is used for specifying ETL
processes. In this approach, data tasks in an ETL workflow can be automatically translated
into SQL queries to be executed over a DBMS. To illustrate this study, the paper addresses the
problem of updating Slowly Changing Dimensions (SCDs) with dependencies, that is, the case
when updating a SCD table impacts on associated SCD tables. Tackling this problem requires
extending the classic RA with update operations. The paper also shows the implementation
of a portion of the TPC-DI benchmark that results from both approaches. Thus, the paper
presents three implementations: (a) An SQL implementation based on the extended RA-based
specification of an ETL process expressed in BPMN4ETL; and (b) Two implementations of
workflows that follow from BPMN4ETL, one that uses the Pentaho DI tool, and another one
that uses Talend Open Studio for DI. Experiments over these implementations of the TPC-DI
benchmark for different scale factors were carried out, and are described and discussed in the
paper, showing that the extended RA approach results in more efficient processes than the ones
produced by implementing the BPMN4ETL specification over the mentioned ETL tools. The
reasons for this result are also discussed."
Description
Keywords
ALMACENES DE DATOS, OLAP, ETL, BPMN