Technical Debt Risk: Review SWA, the FAA and Twitter Outages

Miranda Rudy-Nguyen

June 15, 2023

How all organizations can learn to spot the warning signs

Until recently, “technical debt” was a term reserved mostly for those in IT, specifically, architects, developers, app owners and IT leaders. Thanks to a few high-profile outages at Southwest Airlines, the FAA, and Twitter, technical debt has made it to mainstream media outlets which are reporting on how unchecked technical debt contributed to failures that impacted millions of people and to some, cost billions and immeasurable damage to their brand reputation. 

While these organizations likely wish their hardships weren’t blasted to the public, perhaps the spotlight will serve as a warning to the thousands of other organizations that could share similar fates if they don’t act soon to address their technical debt. As more organizations shift applications to the cloud to enhance their capabilities, the problem will only increase. 

In this Q&A with Bob Quillin, the chief ecosystem officer at vFunction, we take a deep dive into how technical debt happens, the risks of ignoring it, and how it can be efficiently managed before it leads to major issues.

Q: Can you give me a little background on each of these system failures? Let’s start with Southwest Airlines.

Bob: Southwest Airlines has actually had two failures recently. The most recent issue was a firewall failure. Even the vice president said they never know when a failure is going to happen, and fixes have been slow. This is the definition of technical debt risk.

The first outage impacted tens of thousands of travelers during the peak holiday season. At first glance, you might think it was just an unfortunate coincidence, but technical debt typically is most dangerous when there is stress on the infrastructure, so the timing of this crash wasn’t random.

Over the last few years, Southwest has been called out for its outdated systems that need upgrading. How they interact with crew members and guests is very manual and phone-based. Even the pilots and crew have been saying the systems are antiquated. Most major airlines have fully modernized their business processes, whereas Southwest has not. They knew they had technical debt, but they weren’t addressing it. This scenario is typical of most technical debt issues we see in the marketplace. You keep kicking the can down the road and crossing your fingers. 

When you start seeing technical debt being used in both financial and mainstream press as reasons for high-profile business outages, it raises the visibility of the business impact, where the IT and engineering teams aren’t the only ones talking about it. When it causes a billion-dollar outage that impacts millions of people, it’s more obvious even to business people outside of IT. It can affect application availability, firewalls, data security, and more. When one card falls, others fall too, and you never know when it’s going to happen or how many systems it will impact.

Q: What about the FAA?

Bob: The FAA failure was an issue around a damaged database file and is a good example of an aging app infrastructure. With an older monolithic architecture like the FAA has, a single issue in one location has a ripple effect all the way down, cascading to a greater issue. Had they broken down their monoliths into microservices, they would have had a more distributed architecture with greater survivability, so one outage wouldn’t cause others to shut down the system. 

The FAA knew they had an outdated application that needed to be modernized, but it was risky to change. Everyone is adding more features and trying to patch it here and there, so one problem causes so many others. 

Q: Is there a way to reduce that risk?

Bob: You have to directly measure and manage technical debt to try to understand the risk — what are the dependency chains, the downstream effects? To stay in front of that you need a technical debt analysis strategy to track architectural drift and monitor how components are dependent and interrelated. Then you can begin isolating where problems occur, and the blast area is smaller. A best practice is if there is a problem, you are able to isolate it to minimize the cascading effect. Southwest Airlines couldn’t handle the scale, but the FAA had one small problem that cascaded into a bigger issue. It’s why so many organizations are moving to a cloud-native architecture.

Q: Let’s talk about Twitter. It had less of a catastrophic impact, but it was at a minimum, an inconvenience for users.

Bob: The Twitter outage was attributed to a coding mistake. There was a lot of public discussion within the engineering teams sharing that the application has grown dramatically over the years, and it’s slow and hard to change. They traded velocity over performance, spending a lot of time trying to add more capabilities without fixing the technical debt. We see this mistake across many companies.

Twitter is now trying to make more structural changes, replacing old features with new ones, and realizing the code can’t change as quickly as the new management wants. They are trying to ramp up engineering velocity, but the applications weren’t built for that. 

With a cloud-native architecture, they could add those features more quickly with more agility, but the technical debt they’ve accumulated over the years makes it harder to make changes. They’ve taken on too much technical debt to adopt new features quickly, and the application has just become too brittle. Unfortunately, you can’t take a monolith and turn it into a cloud app magically.

Q: These are examples of technical debt risk at large organizations. Does technical debt apply to smaller companies as well?

Bob: Most definitely. If you take a look at the types of organizations we’ve just discussed, we have a 40-year-old major airline, a government entity that’s slower to modernize but has mission-critical applications, then a newer cloud-unicorn company that you’d think is technically advanced. All three share issues around technical debt that are formed for different reasons that caused high-profile issues that transcend from a technical problem to a business problem. 

What typically happens is that technical debt is only discussed inside of engineering and only surfaces when something catastrophic happens. But, all three examples are very visible, and they occur on a smaller scale at probably every company. 

Q: How can a company know they have a technical debt problem?

Bob: Technical debt issues cause many familiar symptoms, like a feature that didn’t come out on time, or you lost a key customer or lost out to a competitor, all of which are often related to your inability to respond quickly due to slow engineering velocity that’s dragged down by technical debt. You can see it occurring at a micro level that’s less visible than a total system crash. You lose a deal, a customer, or market share one drip at a time. All of those things can be because technical debt slows your ability to innovate and keep up with opportunities.

On the flip side, look at what happened to Zoom. Zoom took the pandemic as an opportunity and was able to race ahead of competitors. No one anticipated everyone going virtual. They had the agility to make those changes quickly because they were cloud-native. Other businesses were slower to respond.

What happens when pandemic-effect is over? Can you respond to the next opportunity? All those windows are built upon engineering velocity driving business agility. There is nothing worse for a CTO, senior engineer, or app owner than to have to explain to their CEO or CFO that the company can’t innovate and win because it doesn’t have engineering agility.

Q: So how do organizations typically approach the lack of engineering velocity or business agility?

Bob: Usually, they debate whether they should hire more people or less expensive resources or outsource it. They ignore technical debt and bolt on more and more features to keep trying to move faster. The problem with monoliths is there’s only so fast you can move. Having more people doesn’t always mean you can move faster. You can’t hire enough people or buy big enough machines to keep up. 

The only way to increase velocity to innovate faster is to rearchitect the product. With a monolithic architecture, you have fixed costs in terms of hardware and software infrastructure that are cost-prohibitive. We have one customer that couldn’t buy a bigger machine because it didn’t exist. Their only option was to break up the monolith into microservices to scale up. They could then afford to add resources where it helped the business, but they had more efficiency and applied the dollars they had to infrastructure licensing needs.

Q: Are budgets a significant component here?

Bob: The problem is that companies aren’t addressing technical debt because they don’t want to dedicate the resources for it – time, people, and money. They either need to add more resources or dedicate the time to fix it. Unfortunately, your resource budget isn’t likely to go up and will probably be reduced. So what do you do? 

You can just let things go and keep adding more stuff to it to make it work at the expense of fixing the debt. That works out fine until the rules change. For example, Elon comes in and says we’re going to get rid of this and add this, and then engineers say they can’t make those changes that are required to change the business model that way.

Q: So, there is a cost to carrying technical debt?

Bob: Absolutely. That’s where business planning comes in. You have to look at what technical debt is costing and build a business case to show there is ROI to modernize. How do you break out of this deadly cycle, where technical debt is going up, and innovation is going down? It requires a frank conversation. Before vFunction, there was nothing to build that business case so you could have the conversation.

Q: How does vFunction help build that business case for reducing technical debt risk?

Bob: Our goal is focused on using science and data to analyze your app, determine the most effective way to modernize it, and help you put together a business case. We tell you where to modernize, the reasons and risks, and the upside — you’re spending this percentage of your IT budget on technical debt and on innovation. We can provide those insights in just six months. 

Businesses of all sizes have to have the data, analysis and ability to understand what architectural changes they need to make to get that velocity and avoid outages that others are seeing. More importantly, you get the business velocity you need to get into a win-win situation — minimizing catastrophic events and creating a greater velocity.

Q: In the past, it was hard to quantify innovation, but vFunction can do that?

Bob: Yes. Our software puts numbers on what innovation means. Innovation is a goal, but what does your feature backlog look like in terms of features and new capabilities you want to add to your application? How much is that growing over time, and are those features working? 

If you can increase your feature velocity, that will give you a dollar amount on the other side. Will it add $1M to your bottom line? You can build a business case on feature velocity. You can also understand how much an outage would cost, or if you already have one, how fast you can make bug fixes. There is a cost to that. 

There is also a cost to run an app — high-cost hardware, software licensing, and database licensing. All have a compelling, hard dollar cost. You need a business case with a clear view of what you want to do, where you want to do it, and how long it will take, and make sure you can have a clear discussion about business value. 

Most modernization projects that have been successful have this full visibility into the advantages. That said, you have business-critical apps that need to keep running, and you can’t just flip the switch. There are a variety of best practices, like the Strangler Fig Pattern, to keep monolith alive while you modernize. It’s a risk-averse, programmatic, sequential way to move from an old pattern to a new one without having a drop in services. 

Q: How long does assessing technical debt risk take?

Bob: vFunction Assessment Hub is relatively quick, typically focusing on a core set of apps you determine are worth modernizing, that can be a handful or it could be hundreds that have a business value. Our Assessment Hub is an affordable, efficient and automated way to build the business case, taking less than an hour for one app or a few weeks for a larger application estate. 

Q: Once you understand the extent of your technical debt, then what?

Bob: vFunction Modernization Hub analysis is automated, but it involves active interaction with an architect through our Studio UI to refine and refactor the architecture. But a process that might take years to complete without vFunction takes only weeks or months with it and with higher-quality results. With Modernization Hub, you have the data and the understanding of how the architecture and dependencies improve or not with each change. 

Q: What are the costs and time associated with modernizing with Modernization Hub?

Bob: The cost and time are based on the scale of the app, so the Assessment Hub will tell you how long it will take. Some apps have millions of lines of code and tens of thousands of classes, so it takes more time. Our pricing and estimations are based on complexity and the number of classes within the app. With our service extraction capability, it’s a full, end-to-end cycle. We find a major value in visualizing the recommended service topology and refining the architecture from there. 

Q: What is the role of the architect here?

Bob: The architect stays in control, but we guide them. They can decide if they want to split out the services or combine them. We facilitate those decisions and provide guidelines and recommendations, but it’s important to use vFunction as the expert tool that helps them do their job more efficiently and clearly with observability and control on their end.

Q: Is modernization a one-and-done sort of thing?

Bob: It’s not. It’s continuous because there are always changes to the architecture and apps. But vFunction Continuous Modernization helps you baseline your architecture, monitors the metrics you need to track, and detects critical architectural drift. We alert you when something exceeds an expected baseline or threshold — anything that causes a spike in technical debt that needs to be controlled. Then, the architect can go back into the Modernization Hub to fix it. 

Q: Finally, what’s the ultimate lesson we can learn from the Southwest Airlines, FAA, and Twitter failures?

Bob: The fact that technical debt has worked its way into the business press and everyday conversation is not a good thing. It’s a warning to every business, and now that it’s so public, your business leaders will likely start asking how technical debt is being addressed. 

If you’re not tracking your technical debt, you will miss the warning signs. You’ll start to see slowdowns and glitches, business failures, and failure to meet business expectations. Every application owner is assuming and hoping these issues won’t snowball into a catastrophic failure down the line, but we are seeing more of these happening. 

It’s easy to understand it if you think of it in health terms — like an early sign of a heart attack is a stroke. If technical debt is truly something that can have a critical effect on your business, and you see warning signs, at least measure, monitor and prepare. You need a physical for your application estate. We are like an EKG, identifying where the problems are and their extent. You don’t want to wait until fixable issues grow into a catastrophe like they did with Southwest Airlines. Be proactive now, and you can proactively manage technical debt and control the risk so that it won’t stop the heart of your operations. 

Bob Quillin not only serves as Chief Ecosystem Officer at vFunction but works closely with customers helping enterprises accelerate their journey to the cloud faster, smarter, and at scale. His insights have helped dozens of companies successfully modernize their application architecture with a proven strategy and best practices. Learn more at vFunction.com.

Related Posts:

Miranda Rudy-Nguyen

Get started with vFunction

See how vFunction can accelerate engineering velocity and increase application resiliency and scalability at your organization.