Ensuring the quality of data and information is vital to any project, whether it is a doctor analyzing test-results to reach a diagnosis, or an economist reading the balance sheets of large companies to measure a country’s GDP.
Have you read the previous blog-post “Data Engineering in a Mad Data World”?
We recommend you to read this 1st blog-post to understand the complete environment and responsibilities of a data engineer in today’s mad data world.
In recent years, the norms, concepts, and frameworks to assist us in the implementation, monitoring and management of data quality have been updated frequently. The main reason for this is the infinite increase of data volume, and with that, ways to measure data quality.
In 1986, we measured data in up to 8 dimensions
- Performance
- Features
- Reliability
- Conformance
- Durability
- Serviceability
- Aesthetics
- Perceived Quality
Today, that number has risen to 11
Today, that number has risen to 11 and several definitions have been changed
- Accuracy
- Completeness
- Consistency
- Precision
- Privacy
- Reasonableness
- Referential Integrity
- Timeliness
- Uniqueness
- Validity
There are some authors and experts who believe that Granularity (11) is another dimension that should be used, as it becomes vital in cases where data from telephone calls, e-commerce, and financial transactions is used.
How to ensure data quality
A good engineering and data quality practice is to keep the raw data copy and storage in the data lake, where this copy should always be the same as the source data.
To guarantee the integrity of the data and information in these two processes (copy and storage), the data engineer will need to measure the data according to its type, format, business requirements, technologies used for transport, and its storage. With this, he or she will practically make use of all the 11 dimensions. To ensure a high-quality implementation, we at Devoteam recommend End-to-End management with the involvement of all those interested in this chain. Both users and those responsible for the use case.
With Azure data platform products and solutions, such management is easy to implement. It also provides a 360 view of the entire process. The data monitoring is collected and consolidated by Azure Metrics and made available for analysis to the users through Spotfire dashboards.
The data will be collected from more than 90 different types of connectors. The data is then analyzed, resulting in the reporting of inconsistencies or anomalies, which will be fixed before the data is stored thanks to different features delivered by Azure Data Bricks.
Considering measuring data quality in real-time transactions, our consultants have experience in the process of data ingestion with Azure EventHub integrated with Azure Monitor, to monitor its integrity, consistency, and accuracy. Using Spotfire dashboards, this extracts the information needed for a specific use case.
In the next article we will talk about privacy, where we will see that Azure Data Factory also helps to ensure data masking or its disposal.
For further reading:
MUST-READ: Implementation and roll-out of a Pan European Data Platform for Supply Chain
BLOG-POST: The role of data analytics in the Sales and Operational Planning
REFERENCE CASE: Building a BigQuery Data Warehouse for a disruptive employment agency