Whose Fault? What to Do with Crazy

“Minimizing your exposure to pathology goes a long, long way.” — Dr. Susan Biali Haas

When a sensor faults, it doesn’t stop providing information. It’s just unreliable information. It may be correct, it may not, but it is not to be believed. Like a clock that has stopped but is still correct twice a day, sensors that are in fault are to be treated like a crazy person or a pathological liar.

Four Sensor Architectures

The four most common sensor architectures in the design of safety instrumented functions (SIFs) are one-out-of-one (1oo1), 1oo2, 2oo2, and 2oo3. The first, 1oo1, is most common because it has the fewest components and is the least expensive. The second, 1oo2, is used when it is critical to trip to a safe state during a hazardous condition, even if one of the sensors has failed. The third, 2oo2, is used when a spurious trip has an especially high cost.

The last architecture, 2oo3, provides the advantage of fault tolerance in the case of a single dangerous sensor failure and the advantage of fewer spurious trips in the case of a single “safe” sensor failure; some choose it because they are primarily concerned about safety, but also concerned about spurious trips, while for others, its advantage is that it allows them to reduce spurious trips while also achieving safety objectives. Its disadvantage is that it requires three sensors, instead of one or two, so it is an expensive compromise.

Sensor Diagnostics

The sensors used in SIFs are getting better and better. The most impressive improvements are in diagnostics, where a system is able to detect if a sensor is “bad”.

Diagnostics allow a logic solver to analyze the signal being received from a sensor and in some cases, determine whether the information is valid or not. When it is valid, or “good”, the data can be believed; data from the sensor where the trip condition is exceeded means that a hazardous condition truly exists and likewise, data from the sensor where the trip condition is not exceeded means that a hazardous condition truly does not exist. On the other hand, when it is not valid, or “bad”, data from the sensor is meaningless, whether it indicates that the trip condition is exceeded or not.

When data from a sensor is known to be “bad” or invalid, there are three possible ways to treat the data from the sensor:

Ignore the diagnostics and treat the data as though it is “good” or valid. If the sensor wants to vote to trip, then vote to trip. If the sensor says do not trip, then do not vote to trip.
Because “bad” or invalid data is the same as no data, assume the worst and treat “bad” data from a sensor as a vote to trip.
Because “bad” or invalid data is the same as no data, ignore the sensor and treat the data as non-voting input.

Which Treatment?

Process safety professionals will sometimes look at the three treatments and gravitate toward Option B. After all, it’s conservative. If you don’t know what is going on, go to the safe state. For a 1oo1 sensor architecture, this is the right approach. In an architecture with only one sensor, losing faith in that sensor—even as it continues to share readings—means that you don’t know what is going on. The right thing to do then is to ignore the sensor, whether it says things are safe or hazardous, and treat its fault as a vote to trip.

For the other sensor architectures, however, a fault of one sensor does not mean that you no longer have any idea what is going on. Treating that first fault as a vote to trip may not be the right thing to do.

1oo2

In a 1oo2 sensor architecture, one vote to trip is enough to cause a trip. Because there are two sensors, there is still valid data available, even when one sensor faults. If that fault is treated as a vote to trip—Option B—that single fault means a trip. This completely ignores the valid information coming from the second sensor. Option B in a 1oo2 sensor architecture essentially ignores the sane and lets the crazy decide.

Instead, the first fault in a 1oo2 sensor architecture should follow Option C and ignore the faulted sensor. That leaves the valid sensor to decide in what amounts to a 1oo1 sensor architecture. While 1oo1 is not a safe as 1oo2, it is much better than crazy.

2oo2

In a 2oo2 sensor architecture, it takes both sensors voting to trip to cause a trip. Still, because there are two sensors, there is still valid data available, even when one sensor faults. If that fault is ignored—Option C—then there can never be two votes to trip. Again, this completely ignores the valid information coming from the second sensor. Option C in a 2oo2 sensor architecture ignores the sane and lets the crazy decide.

Instead, the first fault in a 2oo2 sensor architecture should follow Option B and consider the first faulted sensor as a vote to trip. That leaves the valid sensor to decide, again in what amounts to a 1oo1 sensor architecture. Much better than crazy.

2oo3

In a 2oo3 sensor architecture, it takes two sensors voting to trip to cause a trip. Because there are three sensors, there are two valid votes even when one sensor faults. This means you have a choice about how to treat that first fault. Option B – treating it as vote to trip – means that it takes one more vote to trip in order to trip. One vote out of the two good sensors is a 1oo2 architecture on the good sensors. Option C – ignoring the faulted sensor – means that both of the remaining sensors must vote to trip to get the two votes necessary. Two votes out the two good sensors, a 2oo2 architecture on the good sensors. So, 1oo2 or 2oo2? They are both legitimate architectures, so which is right?

Remember that 2oo3 architectures are an expensive compromise between safety and availability. One of those features, however, is likely to be the greater concern. If you chose a 2oo3 architecture because you are primarily concerned about safety, but also spurious trips, then choose Option B as the treatment for the first sensor fault. On the other hand, if you chose a 2oo3 architecture to reduce spurious trips while also achieving safety objectives, then choose Option C as the treatment for the first sensor fault

Never Choose Option A

You’ll notice that Option A—ignoring the diagnostics and treating the faulted sensor as though it is a source of good information—is never the recommended option. That is because it is better to decide based on no information at all than to rely on the unreliable. Even though a clock that is stopped is absolutely correct, to the nanosecond, exactly twice a day, it should never be relied on to make decisions about the time.

You’ll also notice that there is nothing said about what to do with a second fault, or in the case of 2oo3 architectures, a third fault. It’s not that they cannot happen, but that things should never be allowed to go on long enough for them to happen. When you discover that your source of information has faulted—that it has gone crazy or begun lying—then the only thing to do is manage it until you can fix it or replace it. And don’t take too long about it. You should fix it or replace it as soon as you possibly can. Safety depends on getting good information.

Author

Mike Schmidt

With a career in the CPI that began in 1977 with Union Carbide, Mike was profoundly impacted by the 1984 tragedy in Bhopal and has been working on process safety ever since.

View all posts