Leveraging process historian data: context required
The critical step before process data can be transformed into information is adding context, a challenge unique to manufacturing datasets.
Over the years there has been a growing interest and innovation in time series data storage offerings, including historians, open source and data lake options, and cloud-based services. The plethora of choices ensures industrial manufacturers will find a data management option that fits their needs. Whatever the priorities — data governance, data consolidation, security, analytics or a cloud-first initiative — organisations will have many good choices for where to store data.
At the same time, if an organisation is planning to consolidate their manufacturing data in an enterprise historian or data lake they may find they do not accrue the same benefits as they do with other data types. In fact, if they use expectations based on experience with relational data as a justification for aggregating their manufacturing data, they will be disappointed with the results. With some types of data, consolidation in a single system provides advantages for analytics and insights as compared to distributed datasets, but it’s not the same with time series data. Whether it’s a data historian, lake, platform, pond, puddle or silo, time series data won’t necessarily yield better insights just because it’s all in one place.
To understand this, let’s consider scenarios where centralising data does benefit the user, for example, relational data. Relational data has keys that work as handles to the data, ie, tables, fields and column names, so aggregating or centralising data yields more possible relationships among the tables, fields and databases. This isn’t new; business intelligence solutions gained significant traction with this approach starting in the 1990s. Today, data storage is so inexpensive vendors can offer platforms providing ‘any-to-any’ indexes, enabling complete self-service for a business analyst.
Another example is platforms that index all data contained within a semi-structured dataset: for example, web pages, machine logs and various forms of ‘digital exhaust’. Two variations of this approach are used by Google and ‘document’ NoSQL databases such as MongoDB. The idea is that the structure of the data doesn’t have to be consistent or defined in advance as in a relational table. Instead, a schema is overlaid on the fly, or after the fact, which enables the user to work with any ‘handle’ created by the index. Again, this means the more data is centralised and indexed the better. Users get to see more insights across larger datasets and the data is pre-indexed or organised and ready to work.
Keep your data where it is
With structured (relational) and semi-structured (log files, web pages) strategies as success stories for centralising data, it’s easy to see why one could assume consolidating time series data into one place might yield equal benefits to end users, but it doesn’t. IT-centric data solutions may try to convince themselves their centralisation models apply to time series data, but they fail like trying to climb a greasy flagpole: it doesn’t work without handles.
Why is this? Time series data simply doesn’t lend itself to pre-processing the way structured data (relationships) or semi-structured data (indexes) does. There are no ‘handles’ in a time series signal, so there is no way to add value in pre-processing the data for analytics. This is a key issue for engineers working with the data as they have to, at the time they do their analysis, find a way to integrate ‘what am I measuring’ (the sensor data) with ‘what am I doing’ (what an asset or process is doing at the time) and even ‘what part of the data is important to me?’
As an example of the challenges in working with time series data, let’s consider a simple time series dataset that has sensor data recorded every second for a year, or 3.6 million samples, in the form timestamp:value.
Most likely, the user doesn’t want all the signal data for their analysis — instead they want to identify time periods of interest within the signal. For example, perhaps the user needs handles to periods of time within the data for analysis defined by:
- time period, such as by day, by shift, by Tuesdays, by weekdays vs weekends, etc;
- asset state: on, off, warm up, shutdown, etc;
- a calculation: time periods when the second derivative of moving average is negative;
- context in a manufacturing application like an MES, such as when plant or process line is creating a certain product;
- data from a business application, for example when energy price is greater than x;
- a multidimensional combination of any or all of these time periods of interest (like where they overlap, or where they don’t);
- an event, for example, if a user wants to see data for the 90-minute period prior to an alarm.
In other words, time periods of interest are when a defined condition is true, and the rest of the data can be ignored for the analysis (Figure 1).
Two things stand out from this example. First, even with a simple example of one year of data from one signal, it’s obvious there are an infinite number of ways the signal might be sliced or consumed for analytics purposes. And, since there are so many possible options, the actual choosing of the time periods of interest can only be done at analytics time when the user’s intent is clear and the relevant time segments may be identified. In addition, this example is just one signal. Imagine production environments of 20,000 to 70,000 signals such as pharmaceutical or chemical plants, oil refineries with over 100,000 signals, or enterprise roll-ups of sensor data that include millions of signals.
Second, while the examples above use the term “defined by…” to describe the time periods of interest, we can also call this ‘contextualisation’. Generally, industry use of contextualisation is when data is merged from different data sources, like a batch system, signal and quality metric coming together in a data model. But in the examples above, the context can come from anywhere: a measurement, another application or simply the user’s expertise and intuition.
Contextualisation, at analytics time and in the hands of the engineer, is what transforms time series data from a squiggly line in a control chart into data objects of interest for analysis, and all of its forms should be included in its definition (see Figure 2).
Finally, it is important to remember that any analysis of time series data involves sampling of analog signals with strict adherence to the challenges of interpolation and calculus, something that IT data consolidation/aggregation efforts typically don’t address. The ability to align signals with different sample rates from different data sources in different time zones spanning daylight savings time or other changes is an absolute requirement prior to enabling the defining of the relevant time periods.
Exceptions to the rule
There are possible exceptions to analytics-time context because dynamic contextualisation is not always a requirement for some work products. A standard report or KPI, such as a weekly status summary or OEE score, has the time periods of interest defined in advance and then used repeatedly in the analysis.
On the other hand, if it’s an ad hoc investigation such as root cause analysis or understanding variation in a quality metric, then only engineers at analytics time, when they are doing the work, will be able to define the data dimensions required for analysis.
Knowing what you are looking for in advance, or not, is the difference between fixed reporting solutions using static definitions of time periods (even if they provide a drop-down selection of time ranges, it’s still defined in advance), and high-value dynamic, ad hoc, analytics and investigation efforts.
Process expertise is required to create context
One outcome of the need for analytics-time contextualisation is that many data scientists start at a disadvantage when working with time series data. This is because they can’t run their algorithms until the data is modelled, which means focusing or contextualising the data they are working with first. But who knows the time periods of interest and relevant contextual relationships needed for a particular analysis? The engineer. The experience, expertise and ability to define the relevant data for investigations and analyses is all in the same person (Figure 3).
Contextualisation — and the engineers who understand the assets and processes — is therefore imperative to time series analytics regardless of the data management strategy. Only the engineer, who alone has the expertise and understands the needs of their analysis, will know what they are looking for right at the time of investigation. This includes the ability to rapidly define, assemble and work with the time periods of interest within time series data, including access to related data in manufacturing, business, lab and other systems.
There is ever more attention and pressure for digital transformation and the required IT/OT integration necessary to provide an integrated view across business and production datasets. Therefore, it’s going to be increasingly important for IT and manufacturing organisations to recognise the importance of contextualisation regardless of storage strategy in working with time series data.
The engineer who knows what they are looking for, and can ask for it at analytics time, is going to define and enable the advanced analytics and insights that are the focus of smart industry and Industry 4.0 investments. Therefore, organisations that align the contextualisation requirements of time series analytics with their data strategy will have a higher chance of improving production outcomes through insights.
Where the data is — in a historian or many historians, in a data lake, on-premise or in the cloud — isn’t going to change the required analytics effort. There are great reasons and offerings for data storage and consolidation, on-premise and in the cloud, but the priority for insight is accelerating the expertise of process engineers and experts.
Protecting critical infrastructure systems is imperative, but it is necessary to determine how to...
There are very strong parallels between industrial cybersecurity and those of process safety in...
Cybersecurity threats are ubiquitous and far-reaching. But the stakes are highest when the...