During a discussion with a prospective partner I was asked the following question: “You have a tight deadline approaching. Which one would you choose – deliver on time, or deliver with high quality?” But the real question here is “Why would I have to choose when I could have both – eat the cake and have it too!” This conversation was the foundation of the presentation we shared at the Dataworks Conference European Edition 2019 in Barcelona. Watch the entire presentation right here.
NiFi Design Automation at a major telecom company
After almost a year of development in a recent data integration project at a major telecom company, the client in question’s data acquisition backlog still keeps growing, while the project itself is at its finish.
Where are the Backlog Items Coming From?
The data acquisition system I was talking about is Apache NiFi. In order to find a solution, it would be very useful to perform an analysis of our NiFi journey over the past year. The primary questions:
- Where does the effort go?
- Which are the major sources for backlog items, preventing us from moving to support phase?
While performing the review, we identified four major scenarios as sources for the growing backlog.
Scenario 1: Requirements Change
Here is an example of requirements changes which required significant rework:
- Change the type of the data stream sink (stream target)
- HDFS to Azure BLOB
- Azure BLOB to Azure Data Lake Store Gen 1 (Gen 2 currently)
- Write to multiple streams at the same time
- Change the stream source type
- Switch from file-based (HDFS) legacy source to file-based (SFTP) source, compression method and serialization format changed, directory structure changed.
- Switch from file-based (SFTP) batch-oriented to real-time message-based (Kafka).
- All the data at rest must be anonymized
Scenario 2: Onboard New Application
Typically in an enterprise environment resources are shared between applications. This is very often the case with data acquisition systems, like NiFi, Informatica, Data Stage, etc. In this scenario the following challenges arise:
- Multiple teams share the same environment stack.
- Modern data analytics requires an agile, iterative approach. The need for rapid access to the data is very important to enable problem analysis and solution definition.
- Waiting months for a feature/change to be implemented is no longer acceptable. Original requirements and the need might change significantly during that time.
Scenario 3: Technology Evolution
The dynamics of the technologies require that we keep our data pipelines updated:
- Incompatibilities between versions. This is a very common problem with open source products, even when it comes to minor version upgrades.
- Retiring technologies: Could be as simple as replacing one processing component with another, or as severe as completely decommissioning a (originally very promising) technology after a couple of years.
- Shifting from on-premise to cloud. The technology stack is usually different.
Scenario 4: Continuous Refactoring
Over the time we identified different, more efficient, solutions to the original problems. Design patterns evolve and conventions change, calling for refactoring. From a technical perspective refactoring is very important because it reduces the complexity, improves flexibility, maintainability and many other “-ilities”. Here are some reasons for refactoring:
- Data processing design was changed which requires change to the upstream acquisition design.
- Logging and monitoring capabilities are required to answer questions like: “How do you guarantee the data is delivered?”, “Why does the data in the data lake not match the source data?”, etc.
Although very important from technical perspective, refactoring is usually considered with low business value. Requests for funding are rejected with reasons like “I was expecting this from the very beginning”.
Reading the scenarios, multiple problems can be identified. An architect, senior software engineer or data engineer would recognize design smells like violation of the DRY (Don’t Repeat Yourself) principle.
At the bottom line we are trying to optimize the cost. This allows us to think about our problem as an optimization problem. A common optimization approach is to:
- Identify the root cause(s) for the problem. We call them issues.
- Quantify the contribution of each issue in the problem by assigning a “significance” metric. E.g. execution duration, throughput, etc.
- Prioritize the issue list. A simple yet effective technique is to order the list in on metric significance.
- Start with the issue with the highest priority – analyze it and resolve it.
- Re-evaluate the problem. Is the problem solved? Is the performance within acceptable range? If not, repeat above steps until satisfied. In some cases no further optimization is possible.
In this case we could define the problem as “The data acquisition backlog keeps growing over time”. The issue with the highest significance is that the (once) created stream turns into legacy quickly, requiring significant effort for continuous refactoring/re-design. This refactoring /re-design doesn’t bring added business value, which makes it very difficult or even impossible to secure the necessary budget.
How could we eliminate or at least significantly reduce the additional redesign/re-factoring effort? Finding an answer to this question could potentially resolve our most significant issue.
Automation for NiFi Flow Design Optimization
During the design of every application, data modelling is applied. But what if we require a formal definition of the data model?
- Data description could be stored in a machine-readable format, e.g. JSON or relational database.
- Pipeline steps could be templatized.
- Data pipeline could be generated automatically, using the description.
Here is an example of steps and corresponding templates:
- Acquire from HDFS
- Acquire from SFTP
- Acquire from Kafka
- Concatenate JSON
- Concatenate XML
- Store to HDFS
- Store to Azure Blob
- Store to ADLS
For the customer project NiFi Builder, the automated NiFi flow designer, was implemented in Python. NiFi Builder uses direct NiFi API calls to manage the NiFi Flow. This is a proprietary implementation.
After the Barcelona conference, a new, open source, implementation NiPyBuilder started. Although functionally compatible with the proprietary NiFi Builder, it is not simply a NiFi Builder clone, because it is a completely different implementation:
- Use NiPyAPI from Dan Chaffelson to access NiFi REST API
- Modular design
The implemented solution solves not only the original problem, but also improves the overall quality of the enterprise data integration ecosystem.
- In practice, eliminate the refactoring/re-design (backlog tail) effort.
- Support multiple target platforms, e.g. NiFi, Azure Data Factory, DataStage etc.
- Significantly reduce data-source to data-lake time.
- Automated data on-boarding and the first step to DataOps.
- Improve data governance capabilities and the ability to integrate with enterprise the data catalog/dictionary.
- Keep technology up-to-date.
- Reduce the data acquisition stream design waste by more than 90%.
- Reduce the need for specialized pipeline design skills.
- Enforce coding standards and conventions.
Data & Analytics offer
Data and its value are growing exponentially in the 21st century. Major IT challenges are mostly related to data (and its massive growth). These challenges often include questions such as: How do I get a grip on data? Where is my data stored? Which (advanced) analytics possibilities are there to optimize its use? How reliable or trustworthy is my data?