To err is human: using technology to try to solve this problem is equally human

Friday, 11 May, 2012

Humans are wonderful, complex beings. That our very name, homo sapiens, can be translated to wise man, is an indication of our intelligence. We have the ability and capacity to do so much and, along with our advanced language capability, we can reason, problem-solve, introspect and quickly adapt to current conditions around us. But we are not infallible.

Scientists and engineers have, for decades, been trying to mimic the capability of a human being, developing machines, robots and computers that can take on tasks to make our lives better. Huge advances have been made in areas of computer science such as artificial intelligence but we still need people to operate and oversee the systems that run production processes in all market areas.

As humans, however, we are not infallible. We are prone to making mistakes. Phrases such as “to err is human” and “nobody’s perfect” are commonplace in our lives. The likelihood of us making the wrong choice, however, varies depending on the conditions around us. Fatigue, illness, environment and even our diet can vary our performance levels. However, the biggest influencers are possibly extreme stress or pressure. These have been proven to be a major factor and will dramatically increase the chances of us making a mistake.

Now here lies the problem: we rely on system operators to oversee process systems and we particularly need them when something goes wrong. We need them to make the right choices quickly to minimise the impact of a looming situation. It is, however, exactly at this time that they are under the greatest stress. If a control system starts to throw out hundreds of alarms and warnings in quick succession, this can overwhelm or even panic an operator. Hesitation and self-doubt can take hold and it is at these very times we need the operator to make the right decision quickly.

In 2003, ConocoPhillips Marine conducted a study of the initial behaviours that are the root causes of incidents or accidents. It showed that for every 300,000 ‘at-risk’ behaviours there are 3000 near misses, 300 recordable injuries, 30 lost workdays and, ultimately, one fatality. In a control room scenario, if we can maximise the ability of the operator to make the correct decision when called upon, we can maximise human reliability with the aim of reducing the number of at-risk behaviours and ultimately the number of major incidents or fatalities.

Alarms - can you see the forest for the trees?

Alarms are a key feature of a modern plant control system. They can be generated at multiple levels from multiple pieces of instrumentation and even across multiple systems. Unlike the hardwired gauges and alarms of yesteryear, today’s technologies make it easy to create alarms and, often, many are set that are not really needed. Take an example of an alarm configured to trigger when a pump is off. If this pump is a spare pump, this alarm just becomes an extra ‘noise’ to the operator, reducing their ability to observe the important details.

When an operator is sitting in front of a HMI and a plant upset occurs, there is often a flood of alarms in quick succession. This overwhelms the operator’s capability to respond and they may well start to just acknowledge alarms and, with the volume of data presented, miss the critical information they need to make the right decision. It is not a big step, therefore, to realise that reducing the number of alarms reduces the stress on an operator and the unnecessary distraction of data that does not require action. With reduced stress and clearer, critical information to hand, the operator stands a much better chance of making the right decision.

Reducing alarm noise by aggregating alarms - whether it is the chatter of alarms switching on and off as the process is near to a threshold or stale alarms that are continually present - reduces the alarm volume and helps the operator a great deal. The information presented to the operator is, however, also important. Imagine the difference between receiving the following:

a high pressure alarm in a vessel.
information about what has caused the high pressure, what the consequences of this pressure are, the impact of the operator’s reaction time on the severity of the situation and what is needed to be done to respond.

In the second scenario, the operator not only receives much more information regarding what is wrong but detail about what they need to do. This is not to say that the operator wouldn’t know what to do but it does reduce the pressure on them and increase human reliability. In many large industrial accidents the operator has known what to do but was not sure whether action should be taken. A passive endorsement ensures the operator has tacit approval to take action ie, “You need to shut the unit down”.

Alarms should occur when an operator needs to take action. PAS, a company that provides services and software to improve human reliability, when reviewing alarm strategies in plant systems has typically seen as few as 10 ‘chattering’ points in the system causing 60% of the alarms. Working with key plant personnel, it is therefore necessary to review and consolidate alarms with the aim of reducing the chances of human error which could contribute to reduced productivity, equipment damage or a major incident.

Controlling changes

A plant will typically have multiple automation systems. The installation of these may have been over a period of time as the plant grew and developed. Interactions between alarms and various equipment are not always obvious, and a change to an alarm, on face value, may seem a straightforward decision. But unseen consequences of this change may not be realised until it is too late. Here, technology can really help. Intelligent software that consolidates and supervises all of the alarm systems in a process plant, such as Plant State Suite from PAS, can read in alarm settings and check for any changes - whether in threshold value or operator’s inhibiting alarms. Such software can audit against the values that should be present and produce a report showing differences from the documented settings. All operators and managers can then quickly and easily see the current state of the system, and reports showing the current status are part of a formal handover between shifts. Such software helps to highlight changes where alarms may have been inhibited for maintenance and not returned to their correct state; which would otherwise effectively remove a layer of protection for the plant. It can also enforce an alarm strategy by writing the master settings back into the system and provide a documented sign-off process for any changes.

Being careful not to peel off layers of protection

Just like the broader safety assessment, the alarm strategy employs layers of protection. You may, for example, have a high-temperature warning alarm indicating a process has moved away from its optimal conditions; a further alarm indicating the temperature has moved outside the normal operating limits; an alarm above this to indicate the temperature is moving outside of the safe operating limits, after which the safety instrumented system (SIS) would typically trip, and a final alarm should the plant not trip, to indicate temperature is above the design limits of the system. These ‘onion’ type layers are designed to protect equipment and personnel. In many applications, these levels, or boundaries, may change. For example, reaction vessels with catalysts that activate under certain conditions can make the response dynamic and the alarm limits or trip settings will then need to be adjusted accordingly. In areas such as offshore oil platforms, trip settings will vary dependent on the production rate.

Whether a process is static or dynamic in its alarm requirements, comprehensive boundary information for all of the different pieces of equipment may not be readily available to operators through the alarm system in front of them. Also, the impact of a change to boundary settings may not always be clear, with multiple pieces of equipment affected.

Technology can again provide a framework to consolidate multiple sources of information and provide additional checks to ensure process boundaries are where they need to be. Ideally, alarm management software should aggregate all of the alarm settings and other operational boundary information and present it to the operator in the context of a boundary hierarchy - exposing vulnerable areas and providing the operator with the information they need to prevent the system tripping and to keep productivity optimal. By supervising alarms across the plant, it should be possible to automatically detect and report any deviations to this boundary hierarchy, such as an alarm setting that is set higher than a safety trip point. This provides additional assurance to keep configuration parameters such as alarm limits and instrument ranges within the safe operating boundaries of the plant.

Figure 1: Example of an alarm boundary heirarchy.

Figure 1: Example of an alarm boundary hierarchy.

Action can only be taken if the operator knows it is needed

How information is presented also has a big impact on human reliability. A high-performance HMI (HPHMI) uses the best practice in design for a control interface. Poorly designed HMI graphics have been shown to degrade safety, quality and process efficiency. A HPHMI gives an operator a much higher chance of dealing with a process that is moving outside of its optimal limits with clear and direct display of critical information. Research conducted for the ASM Consortium’s Nova Study in the early 1990s showed that it can give an operator a fivefold increase in the ability to deal with an abnormal situation before an alarm occurs. Furthermore, it can lead to over 35% increase in making the right choice in handling an abnormal situation and over 40% reduction in the time taken to complete the necessary tasks.

Avoid the incident

Safety and productivity are vital factors in business. Plant control systems today have often evolved over time with multiple vendors’ systems installed at different stages of the plant life cycle. Most of the time the expertise of operators keeps processes running smoothly. When there is the possibility of a major incident, however, they are under the most pressure and mistakes are more prevalent. Using technology in the form of software solutions that give better alarm and boundary information, and better data presentation, will greatly increase the chances of timely action to prevent safety trips, avert major incidents and keep productivity high. To err is human but it is also human to try to solve this problem. The goal is just that: to improve human reliability and improve safety, compliance and profitability.

To err is human: using technology to try to solve this problem is equally human

Alarms - can you see the forest for the trees?

Controlling changes

Being careful not to peel off layers of protection

Action can only be taken if the operator knows it is needed

Avoid the incident

Demystifying zero trust in OT

The important role of software engineering in industry

Calibration explained: principles, processes and modern reporting

Content from other channels on our network