Data Engineering in a Mad Data World


This will be the first article of a series focused on data engineering and its subdivisions. Let’s immediately get into the best practices regarding the treatment of data and the phases it needs to go through before it’s used, in order to avoid misuse and incorrectness of information. 

A mad data world

In a world where more and more terabytes of data in all kinds of formats, quantities, and qualities are generated every second, data engineers have become vital assets for organizations that aim to be data-driven. To give you an example: health professionals, such as doctors who are active in Europe and in search of their master degree, have to study best practices in engineering, data science, and programming languages like Python and R. Fortunately, cloud computing solutions like Azure Big Data Landscape bring enormous advantages. One of the biggest benefits is cost reduction when it comes to the allocation of cutting edge technologies and ease of seamless integration.

Data engineers need to keep in mind that each use case may need a different solution. This varies from transporting the data to an Azure Blob or Azure Data Lake, to treating or normalizing data in order to transform it into something simpler or add other data to it. This may even be coming from a totally different source, such as Databricks and Azure Cosmos DB

Extract, transform, load

One of the terminologies well known in the data market is ETL, short for extract, transform, load. These are three database functionalities that are combined to pull out data of one database and ingest it into another database. In short, this is a data engineer’s main responsibility.

  1. Extract is the process of reading data from a database, a file or a system. In this stage, the data is collected, often from multiple and different types of sources.
  2. Transform the data. Converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. This can either be for a dashboard, via PowerBi, a model, or Azure Machine Learning Studio.
  3. Load is the process of getting the data into the target database.

There are also other cases in which the data is obtained from web servers, IoT devices, images, videos, or streaming data. In these cases, Data Engineers can make use of technologies (such as the Azure Event Hub) to transport the data from the IoT devices, and use Databricks to transport the data from the logs to the Azure Data Lake. Over Here the data is stored and quality, governance, and privacy measures are applied.

Tip:  Azure functions are now integrated with the data factory, bringing even more functionalities and facilities to the ETL process. 

To conclude

Data engineers are responsible for understanding the use case. The data needs to serve that use case. With that in mind, a Data Engineer should implement the best practices and tools to perform ETL – data extraction, transformation, and loading, without losing sight of quality, integrity, and security. This is where the Azure Data Platform comes in. Next to its wide variety of tools, it also delivers the best in Big Data as a Service in the Cloud market.

Read the next article: How to achieve a High Standard of Data Quality



Gustavo Lima

Senior Consultant Data