This is the second in a series of blog posts on Data Platforms for Manufacturing. Throughout this series, I will try to explain my own learning process and how we are applying these technologies in an integrated offering at Critical Manufacturing.

For reasons that I started exploring in the first post of this series, many companies seem to be strategically lost among the various digitization initiatives and with severe difficulties in achieving results from the investments made, particularly in the topic of Data Analytics and IoT.

Contrary to what happened in the previous industrial revolutions, manufacturing has lagged in implementing the base technologies underlying this data transformation. Before getting into the data platform topic, there’s one related aspect worth exploring: Dark Data.

Shedding Some Light on Dark Data

IoT (broadly speaking) generates tremendous amounts of data and will grow at an extremely high pace. Forecasts abound and numbers are stratospheric. As an example, Statista estimates that the amount of data generated by connected IoT devices will amount to 79.4 zettabytes in 2025. Moreover, the two categories that will see the fastest data growth are industry and automotive.

Any beneficial technological revolution or trend unfortunately also has its negative side effects. Over the past few years, a significant number of solutions, tools and platforms have been created that allowed companies to store and process an explosion of generated data with the intent to extract an increasing amount of information that could be used for analysis, predictions and other useful insights.

But the problem is that most of the data that companies produce (and manufacturers are no exception) isn’t really useful. In fact, most of the data generated in a manufacturing environment by machines, materials, transportation systems, warehouses and employees in the form of logs or database records may never be used or seen again after being stored!

This data, which has no immediate value and does not translate into useful information, even when orchestrated and organized, is called dark data. Gartner defines it as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets.“

Another interesting aspect is that most data becomes irrelevant if not processed fast enough. This is called “perishable insights” and calls for quick action to escape this fate through a technology called streaming analytics – hold this thought for now as we’ll talk about this in greater detail in one of the subsequent posts of this series. For now, just consider that according to IBM, 60% of data loses its true value within milliseconds.

A significant portion of this data remains on the dark side during its entire existence. For most companies, there is usually an enormous amount of dark data, which is never stored or analyzed, with up to 65% of dark data being hidden within machines, networks, and people. One of several reasons is that a lot of such data is unstructured – “content that does not conform to a specific, pre-defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables.

Storing Dark Data

However, companies still store a significant amount of data, even if they don’t have an immediate use for it. According to Rahul Telang, professor at the Heinz Faculty of Information Systems and Public Policy at Carnegie Mellon University (USA), dark data represents about 90% of the data companies store (not data produced, but really stored).

A quantity of data of this magnitude does not go unnoticed. Despite the downward price evolution of hardware and storage systems, saving this data is extremely expensive. The reason? Because in most cases, it is not simply stored in raw format, but it is orchestrated, cleansed, transformed and organized. So, the question is why do companies keep this data?

While dark data may never be used or be useful, companies still keep a significant portion of it for two reasons:

  • To minimize risks and liabilities primarily related to regulatory compliance but also to litigation. Especially in industries with strong regulatory requirements (such as food & beverage, healthcare or semiconductor), noncompliance can jeopardize business continuity. Digitizing paper records stored in physical warehouses, even with a high percentage of dark data, is a step towards optimization & efficiency.
  • To safeguard the future possibility of utilizing this asset to generate useful insights, even if today this is not of immediate value. In fact, most manufacturers are increasingly aware that the data they produce is not only valuable today for more or less immediate decisions but will soon be able to feed machine learning algorithms that can result in predictive models for behavior and performance that can bring them additional competitive advantages.

While the first reason is essentially pragmatic, the second is very future oriented. However, as the amount of dark data keeps increasing, the costs of storing must be lowered. One of the ways to do this is quite obvious, although frequently ignored: when saving dark data, it is not necessary to carry out any transformation, orchestration or curation of this data. It is data in a “reserve mode” for possible future use.  Not having immediate requirements for storage performance or analysis, it can be saved in less elaborate and less expensive formats, as long as the storage is reliable and can be used when necessary. 

This is why solutions that allow storing events in raw format are so appealing. The key aspect is the ability to reliably and inexpensively store data, forever if necessary, and extract it at any time. As we’ll see later, Apache Kafka is one of those solutions.

But for now, the central point to realize about dark data is that it does not have to stay dark. At the moment that dark data is used to gain insights, it becomes actionable, and by definition, it is no longer dark.

Quite frequently, manufacturers are just not aware of the dark data existence. So, there is a need to raise the awareness of its existence and opportunities that can come from harvesting dark data. Then, the platform that will support dark data analytics should be put in place.

As the series progresses, we’ll start exploring the main components of a manufacturing data platform. In the next post of this series, we’ll discuss edge solutions. Should you want to continue reading this and other Industry 4.0 posts, subscribe here and receive notifications of new posts by email.