This is the fourth in a series of blog posts on Data Platforms for Manufacturing. Throughout this series, I will try to explain my own learning process and how we are applying these technologies in an integrated offering at Critical Manufacturing.
For reasons that I started exploring in the first post of this series, many companies seem to be strategically lost among the various digitization initiatives and are experiencing severe difficulties in achieving results from the investments made, particularly in the topic of Data Analytics and IoT.
Contrary to what happened in the previous industrial revolutions, manufacturing has lagged in implementing the base technologies underlying this data transformation. Significant amounts of generated data are being stored with no immediate plans for being used: this is dark data. Edge solutions will decisively contribute to the convergence of IT and OT, although to be effective, have to consider a number of requirements related to architecture, automation and central management.
Today’s post talks about Ingestion, Data Lakes and Message Brokering.
Not so long ago, production systems were essentially transactional. MRP or MES systems executed transactions based on data that was stored in relational databases.
As databases kept growing, companies realized that the means to improve the performance of their factories was by using data analysis, which was needed as real-time as possible to make the analysis useful to their decision-making. These analytical systems however increased the loads in the databases of the transactional systems that were the backbone of the business.
As the need for data analysis grew, and the database loads continued to expand, industrial companies began to change the paradigm and in the early 2000’s created Data Warehouses. The data would be loaded in batch mode from the transactional databases by ETL tools (Extract, Transform and Load). In addition to extraction, the objective was to standardize data from different systems into a single representation and single structure within the Warehouse.
Before the data is loaded, it is filtered and cleansed to generate “information” that is stored in OLAP (Online Analytical Processing) technology. The data is pre-calculated and aggregated, making analysis faster. OLAP databases are divided into one or more cubes with dimensions (e.g. time and date), created to facilitate data analysis.
As the data sources grew in size and variety, things got more complicated, and the Data Warehouse implementation projects became extremely long and prohibitively expensive, since they implied a normalization of all business data.
At the same time, the need to include unstructured data also increased. Photos or multimedia files, data maps, test results, etc., do not have a predefined data model, and it was difficult to handle these data types using traditional database systems.
In addition, as businesses grew to rely on their data warehouses, they faced increasing scale challenges. The underlying technology was designed to grow vertically on single machines, which became increasingly larger and more expensive as the businesses scaled, adding pressure to reduce the amounts of data stored. At a certain point, projects would routinely take 18 to 24 months to build and get an enterprise-wide Data Warehouse ready to use.
Companies searching for cheaper ways to store their data found an open source framework called Hadoop, which through a distributed file system called Hadoop Distributed File System (HDFS) and a batch data processing framework called MapReduce, allowed the use of commodity hardware to store large volumes of data and thus significantly reduce costs.
Hadoop provided the foundation for a ‘data lake’ architecture. In this new paradigm, there’s a clear separation between storage and processing. Storage is achieved through a centrally stored location based on a distributed file system, which encompasses structured, unstructured and semi-structured data – the data lake.
There are different data processing engines optimized for different types of processing which connect to the data lake, retrieve the necessary data for queries, and compute the answers to those queries.
With Hadoop, a highly scalable storage platform was achieved, as it can store data durably and cheaply by distributing across hundreds of inexpensive servers that operate in parallel with very little overhead, scaling almost linearly.
Data lake architectures have solved many problems and are used in most private clouds, but they do not solve all problems. Although they are cheap and scalable, they are not particularly fast in both writing and reading data for queries.
Apache Kafka’s Contribution
The Apache Kafka Project started within LinkedIn, later becoming an open source project. LinkedIn needed a solution to track activity events generated in their website such as pageviews, likes, reshares, keywords typed in a search query, ads presented, etc., which were obviously critical to monitor and generate insights around user engagement and overall relevance of their products. As they didn’t find any commercial solution that would fulfill their requirements, they decided to create their own.
The solution created is fault-tolerant and persists the data both in a durable and immutable way, indefinitely if needed. Unlike data lakes, it offers high throughput, allowing it to cope with thousands of messages per second with very low latency– in the range of milliseconds.
Kafka runs as a cluster on one or more servers that can span multiple data centers, which store streams of records in categories called topics. Scalability is achieved by adding additional nodes. Kafka also leverages capabilities like replication and partitioning.
Kafka replaces message brokers with producer API’s, which allow applications to publish a stream of records to one or more topics; and consumer API’s, which allow applications to subscribe to one or more topics and process the stream of records created by the producers. But unlike message brokers, Kafka is natively persistent.
While in a normal message queue system, it is not advisable to store data (reading a message removes it from the queue and such systems scale poorly as data accumulates beyond the memory size), Kafka doesn’t have such problems. It is persisted to disk and replicated for fault tolerance. This has been done primarily to guarantee that the system wouldn’t collapse in case data would be produced at a much higher pace than what would be read. It would therefore store the messages until the consumers would have the ability to read the messages.
Kafka should be looked at like a log system for events, coming from any system (equipment data, MES, ERP, etc.), which then replicates the data into the next element of the data platform architecture, the stream processing. Kakfa is one of the most widely adopted open source projects and has many similarities to a data lake: it is a storage and transfer system for real-time streams.
In Flow Data Rivers
Water analogies abound in the new data management architectures. After the Data Lake, there is the Data River. Why? Data sources coming from a variety of locations are like streams flowing together to create the main Data River, an end-to-end streaming analytics stack to ingest real-time data, store it, and build dams. Once a company is ready to tap into possible insights to improve whatever process it needs, it will build a dam to execute complex queries to access not just the stored historical data, but also brand-new data being generated.
This will be discussed in the next post of this series, dedicated to Processing and Serving.
Should you want to continue reading this and other Industry 4.0 posts, subscribe here and receive notifications of new posts by email.