Article: IoT downtime: How to accept and identify blips, blocks and bombs

By TJ Butler, Chief Software Architect, Mesh Systems

Article originally posted on The Internet of Things Agenda Tech Target blog

The first thing you need to do when designing an IoT system is to get everyone onboard with the fact that it won’t work — at least every once in a while. Even if you’ve spent the time and expense to achieve several 9s of reliability for the stuff you build, there will likely be a dependency — somewhere in the chain — that is simply out of your control and causes some sort of IoT downtime.

IoT systems are often enabled by several supporting services on the back-end. Third-party services such as network providers, messaging platforms and even the core infrastructure of the Internet add dependencies between the sensors in the field and the dashboards they feed. While these supporting services are convenient and often necessary, they are managed by someone else and are consequently a risk. Maybe an IT person unknowingly changes the Wi-Fi configuration. Or maybe an ISP blacklists traffic from your devices, or there’s a widespread cellular outage. It’s important to identify dependencies, minimize the risk where possible and set expectations accordingly. The larger the system scales, the more distributed it becomes and the more likely it is that the spaces between “your stuff” will hiccup and you’ll have to deal with it. This means you have to plan for failure and handle it well.

“Handle it” doesn’t stop with technology. It’s important — arguably more important — to set the business expectation that IoT downtime could be a natural occurrence. It should spark conversation around expectations and the cost/benefit of making something truly highly available. You have to weigh the cost of support issues against the ROI of the opportunity at large. When margins are thin, even a single truck roll could kill the ROI for several units. It’s definitely not easy to quantify, but having those discussions early on will go a long way to minimizing the overall impact of issues.

Here’s a real scenario: An IoT system reported that several devices were not communicating after a routine firmware upgrade. The team determined that most of the devices were working as expected, but several devices were indeed offline. After a long troubleshooting session, they found the outage was coincidentally timed with a widespread DNS issue caused by a totally unrelated event which prevented the devices from communicating for several hours. Of course, the customer wanted to fix it immediately, but they had no way to communicate with the devices until the public DNS service was restored; the customer simply had to wait it out. Ironically, not long before, the same DNS service enabled them to avoid a previous outage by quickly pointing all of the devices to a failover system. Point being, dependencies aren’t inherently bad, but you should understand the tradeoffs.

Identify types of IoT downtime outages: Blips, blocks, bombs

A good practice is to walk through the entire system and identify potential types of outages. This could be as simple as walking through a block architecture and asking, “What happens if this block stops working?”

Ask yourself questions in this format, if this block fails:

·        Can units still be shipped/produced?

·        Would it prevent from someone from doing their job in the field?

·        Can units still send data?

·        Can we still communicate with the devices?

·        Are we dropping data?

·        Will it impact accuracy or quality of data?

·        How does this outage affect billing?

·        Will we have to visit the site?

·        Asking these types of questions will likely identify situations that fall into the category of blip, block or bomb.

BLIPS

Blips are the most common errors associated with cloud computing, typically called “transient errors.” Blips are typically short, on the order of seconds or minutes. You should plan for blips as if they are a common occurrence; implementing retry logic is typically all that is required. Blips often happen when a service is busy and access to it is temporarily throttled. Another example might be regional network congestion. From a UI perspective, just adding a little feedback can go a long way. Let the user know that you hit a blip but we’re still trying.

BLOCKS

A block is typically infrequent, but a significant step up in severity. Blocks are typically longer, on the order of minutes or hours. Think cellular outage or the DNS issue explained earlier. In these cases, retries fail and you’ll need to implement something like a pipe and filter pattern to queue up work that is waiting for the block to clear. Another common way to handle a block is to store and forward. These methods allow the stuff on either side of the block to continue normally and minimize any data loss or downtime. Blocks are extremely disruptive and potentially costly if left unhandled. For example, a service technician might not be able to do his job because the onsite workflow takes a dependency on a system that is not available.

BOMBS

Bombs can occur when the previous situations persist to the point where they result in a cascade of failures. A bomb is as bad as it gets and usually something that requires manual intervention to recover. A cascade might look something like this: A cellular outage causes all of the devices in a huge region to reconnect at the same time, creating a denial of service on the servers which, in turn, causes the server to restart … which starts the cycle all over again. Your goal here is to identify bombs and work to put barriers between them in an effort to convert them to blocks or blips if possible.

 

In any case, feedback to end users is a huge help and will go a long way towards mitigating the frustration associated with outages and IoT downtime. Don’t leave a user in the dark if you know something isn’t working. Let them know so they can do something else instead of hitting a button over and over.

Taking the time to think through these scenarios will help everyone appreciate how outages impact their part of the business. This is as much an educational and discovery process as anything. Probably the most important suggestion is to give yourself health and diagnostics visibility focused on dependencies. That visibility can be a huge time saver; without it, you’ll end up spending time hunting down something you can’t fix anyway.

 

About TJ Butler

TJ Butler is the Chief Software Architect at Mesh Systems where he is responsible for the design and development of IoT systems. Mesh Systems provides IoT software and services for product manufacturers.

TJ has helped start and grow several companies during his 20+ years of experience. That entrepreneurial spirit ensures his engineering approach is grounded with a solid business perspective.

TJ’s career began with industrial automation building manufacturing intelligence systems, which has been a fantastic foundation for IoT. Just prior to Mesh Systems, TJ led the engineering team for Symantec's Risk Automation Suite after joining through an acquisition of Gideon Technologies where he was the Senior Architect. Butler is named as an inventor on two patents and holds a Computer Science degree from Purdue University.

Friday, March 4, 2016