Improve Your Root Cause Analysis
Problem solving begins with an understanding of root causes; failing to understand root cause makes it impossible to solve problems that arise on the floor
By Duke Okes, CMC
Owner and CEO
APLOMET/Applied Logical Methods
In many organizations, the process of problem solving is guided by a formal model such as Plan-Do-Check-Act (PDCA), eight discipline (8-D), or a corrective action process such as required that by ISO 9001 and other related industry quality management standards. While these models are useful, they often don't provide sufficient guidance for the more critical step of problem solving--root cause analysis. For example, in the 8-D and corrective action models it's implied that root cause analysis is simply one step. Perhaps this is true for someone who is highly experienced, but for most, some additional guidance is useful.
In effect, root cause analysis is not the same as problem solving. Root cause analysis is only the diagnostic part of the problem-solving process, but it's the part that, if done incorrectly, causes all subsequent actions to have little value. Root cause analysis means finding the specific source(s) that created the problem so that effective action can be taken to prevent recurrence of the situation.
A beginning issue for problem solving is the need to understand that there are different types of problems. One classification useful for understanding the difference is whether the problem is a creative or an analytical problem. Some problems simply require identifying and selecting among solutions (a creative approach), while others require identifying and resolving root causes (an analytical approach). For example, if someone locks the keys in his/her car, choosing among options such as calling a locksmith, breaking out a window, or getting a taxi home for the extra key will be more beneficial than trying to ascertain the reason for the mistake. If that same vehicle doesn't start one morning, however, only identification and rectification of the cause will likely ensure that it will start the following morning.
Another useful classification of problems is whether they are repetitive or a single event. Repetitive problems are those that occur frequently over a period of time, which allows collecting and analyzing data related to each occurrence to detect patterns. Single event problems are those that occur at a single point in time, and require investigation of activities leading up to the event.
There are also different types of root causes that create problems. Some causes may be technical in nature, while others may be due to human decisions and actions. Both types of causes can create the same result, such as an automobile that doesn't start because of a faulty switch, or because the driver left the interior lights on overnight. In organizations it is not unusual to find that what originally appears to be a technical problem (e.g., a machine that suddenly can't hold specs) is actually the result of a human decision (e.g., switching to a different supplier as part of a policy to minimize the cost of purchased materials).
Root cause analysis is the process of drilling down from symptoms, to problem definition, to possible causes, to actual cause(s). Doing so is an iterative process that combines divergent and convergent thinking. Several common process analysis tools can be useful throughout the process, such as:
- Flowchart--Allows understanding the operation steps involved in the process, and where data could be collected in order to identify the major contributions to problems.
- Brainstorming--Provides a mechanism for identifying all possible causes, from which those least likely can be eliminated based on logic, with data collection then focusing on the most likely.
- Logic tree diagram--A tool for breaking down the system/process into functional subsystems/components, identifying the logical cause and effect relationships.
- Run charts--Enable analysis of data over time to look for trends/patterns that may indicate the root cause.
- Histogram--Unlike run charts, histograms group all the data into one distribution, the shape of which might indicate other patterns worth investigating.
- Pareto diagram--Can be used to analyze categorical information on root causes to identify the major contributors.
- Statistical tests--While graphical tools such as run charts and histograms are good ways to analyze data, they are not as sensitive to small differences that might exist between sources of variation. Statistical tests such as the t-test, F-test, analysis of variance (ANOVA), and chi-square test can detect small differences based on desired levels of confidence.
- 5-Whys--The process of asking why is a combination of what occurs when you are creating a logic tree diagram and collecting and analyzing data to support or reject each branch as contributing to the problem being solved. Presentation of the cause-and-effect logic that led to the final conclusion reached can be done in a pictorial 5-why diagram.
Sometimes solving a problem just requires a little organization. For one team that was trying to reduce downtime of a continuous process line, brainstorming ways to reduce downtime was getting nowhere. They needed some data to indicate which actions would be more beneficial. As a first step, the team flowcharted the process. While seemingly very simple, this step ensures that all players consider the entire system, rather than only the part with which they are familiar.
The team then decided to find which of the pieces of equipment contributed the most downtime (it was not the one most would have thought, since the visibility of problems differs from one machine to another). They collected and analyzed data, which led them to focus on machine B. They then focused on classifying causes of downtime (which again was not what most would have thought, since it did not involve maintenance personnel). Changeover time was the major culprit, and the root cause of long changeover time was then identified and addressed. The resulting reduction in downtime yielded annual savings of $360,000.
In another case, a dial index machine being used to pin two parts together was not able to meet the specification for centrality of the pin relative to the part diameter. Possible causes considered included:
- Parts were loaded incorrectly,
- The drill or ream stations had too much variation,
- The machine rotation index was not accurate enough, or
- Fixtures were not holding the part well enough.
A histogram of centrality data was created, then broken down to look at each fixture. It was surmised that if the problem were because of loading, machining, or indexing, it would show up randomly across all fixtures. If it was a fixture problem, each fixture might demonstrate its own distribution, which in fact was the case. An engineering review of the fixture design determined that tolerance stack-up was too high, and the fixtures were replaced. The resulting reduction in rework and scrap amounted to $500,000 per year.
Personnel dealing with a heat-treat process discovered that they were intermittently seeing excess variation from location to location within the oven, and failure of thermocouples was determined to be the cause. Because each failing device had been replaced, however, it was determined that the real root cause had not been found, which led to looking for deeper causes.
A logic tree diagram was created to look at possible causes. One by one each branch of the tree was expanded or eliminated by looking at related processes and conducting interviews. It was finally determined that the organization's policy of purchasing MRO items at the lowest possible cost had led to the purchasing personnel changing thermocouple suppliers. Although the cross-reference part number was the same, the manufacturer of the replacement item did not produce a part that was the same in all ways, which led to early failures. Purchasing and top management were provided with supporting data showing the costs related to failures, and how this compared to a higher price per thermocouple.
Carrying out effective root cause analysis, as other authors have indicated, requires more than a random search. It requires a methodical approach that combines logic with knowledge of process technology. While every situation will not require the actual use of the tools mentioned above, the thinking process behind the tools is required.
The basic steps involved in finding the root cause are:
- Understand the process, including the structure of the system (and subsystems) involved, and how it performs over time.
- Identify all possible sources of errors or variation in the process, and select those sources that require further analysis based on your current understanding of the problem.
- Collect and analyze quantitative and/or qualitative data, and match the findings to the sources identified in the previous step that can actually produce the outcomes observed.
Some testing to confirm that the suspected cause is in fact the contributor may be useful. This can be done by removing the source, replacing it with one believed to not have the same effect, or perhaps trying to make the problem worse by magnifying the contribution of the cause. Of course, issues such as risk to personnel and the system, and costs involved, must be taken into account when preparing to conduct such tests.
Once the root cause has been identified, action must be taken to eliminate it. Again, a combination of creative and analytical thinking is useful. Creative thinking allows coming up with many possible solutions, rather than quickly jumping to one. Analytical thinking allows selection of the best solution, for example by creating a decision matrix that analyzes possible solutions according to factors such as cost, timing, and probability of success.
Implementing the solution requires a combination of project-management and change-management skills. Project management involves identifying what actions should occur, when, and by whom, while change management looks at what potential resistance might be met and how this can be addressed proactively. There's nothing worse than having a great solution that everyone decides they won't adopt.
The final steps involve evaluating the impact of the change to determine whether or not the problem has in fact been solved. If not, the decisions used to get to this stage of the problem-solving process should be reviewed to determine where an error might have been made. Follow-up on solution effectiveness may be required from both a short-term and long-term perspective, because changed systems often tend to revert back to habit. To ensure that the change is standardized, work procedures and training processes should be revised. In some cases selection criteria (for equipment or personnel) may also need to be revised.
In a world of complex systems, where organizations and individuals are willing to tolerate less risk than in the past, improving our abilities to perform root cause analysis will be as important as improving other business processes. To improve root cause analysis, organizations should begin by having a standard model (which itself may be improved over time) that is understood and used by everyone. It's also important, however, to recognize that while the model itself may be linear, problem solving often is not. It is, rather, an iterative process of creative and analytical thinking, combined with knowledge of the technology of the system being improved.
This article was first published in the March 2005 edition of Manufacturing Engineering magazine.