FAA Failure: A Failure in IT Operations and Governancein Blog Article by Jeff Hare
In early January, FAA software caused US flight operations to halt for several hours. For a summary of the software failure, see Adam Levin’s Bloomberg article “FAA Computer File Caused by People Who Damaged Data File”.
According to the article, a “data file” responsible for the chaos was damaged because of a failure to follow government procedures.
I would like to highlight some particularly interesting comments made by Levin. First, “[t]he preliminary indications are that two people working for a contractor introduced errors into the core data used on the system…according to a person familiar with the FAA review.”
Two people working together to make this change immediately draws suspicion of collusion, but “[a]gency officials are attempting to determine whether the two people made the changes accidentally or intentionally, and if there was any malicious intent.” As if to pique the interest of auditors and risk advisors alike, “[t]he file or files were altered in spite of rules that prohibit those kinds of changes on a live system.”
First, let’s assume that the article is accurate. The facts stated in the article are:
- Procedures were in place to prevent someone making “those kind of changes” in a live system.
- Two people working for a contractor introduced errors into the system.
We can draw conclusions that this could mean contractors are authorized to make changes to production, but certain “kinds of changes” are not allowed and yet users have the ability to make them.
This leads us to ask the following questions:
- Why are the contactors provisioned access to make changes in Production that they are not authorized to make?
- Why are contractors allowed to make any changes to Production?
The purpose of the change management process for any organization is to first prevent unauthorized changes in Production that could have negative consequences in the functionality provided by the system. What I would expect as controls to support the change management process are:
- First, there should be a separation of duties between the development of a change and the migration of the change into Production. IT management and auditors expect that the developer of a change does NOT have the ability to migrate such change to Production. This is IT governance 101.
- Second, changes being considered for implementation are tested before they are moved into Production. This requires a non-production environment similar to the Production environment to be used for testing by someone independent of the developer.
- Third, logging should be in place to track the changes in Production. This process would be used in a situation such as this to evaluate the change(s) responsible for the outages easily and quickly.
Our advice to our clients is to control the key changes in Production through internal staff.
Management needs to have full control and accountability over all changes being made in production. Even where development and support activities are outsourced, management needs internal staff to move such changes into Production themselves. Red flag number one for me is the fact that contract personnel have ANY access to Production allowing them to make changes to data. This is a significant IT governance failure.
Whether ignorantly or intentionally, the Federal Aviation Association appears to have failed to comply with this governance principle. Blind trust doesn’t prevent users from making “those kinds of changes”.
Beyond its failure to protect the production environment, the FAA made a second colossal error. Access was provided to the contractors beyond what they were authorized to do. Management’s misstep was in the design of access for the contractors. If certain changes were acceptable and other changes were NOT acceptable, then access controls (see NIST role-based access control (RBAC)) were not properly designed and implemented from the start.
Based on this article we are to believe contractors are authorized to make certain changes in Production, but they have functional capabilities to make changes above and beyond what their authorization allows. In addition to the absence of an all-important segregation of duties here, the FAA also overprovisioned access controls. Contractors were provisioned access to data and files that they had no authorization to change.
The Conspiracy Theorist in me has me questioning the official narrative.
Later that same day a similar outage happened in Canada. I read an article “Computer Outage Hits Canadian Flight System Hours After US System Went Down” based on a tweet from a government official. The following day I read an article where the postal system in the UK was taken down as well.
As incompetent as I believe our federal government is in so many areas, I cannot believe that two monumental breaches of IT best practices failed in an organization so critical to our economy.
Another explanation for the FAA outage was that there were successful cyber-attacks in the US, Canada, and UK within 36 hours and our governments did not want to admit it.
Regardless of whether you believe the official narrative or not, the article is an interesting case study that can be used by those in internal audit and IT compliance to discuss these risks with management.