Failure Modes and Effects Analysis – or FMEA for short – is widely used across many industries. Often in the design phase of new equipment. But also to troubleshoot poor performing equipment. In this article I will give a detailed overview of FMEA’s. The origin of FMEAs, when to use them and how to conduct an FMEA. I’ve also included an easy to use FMEA template. This is a long and detailed post, I will always bring it back to our main focus. And that is how to use FMEAs to improve plant reliability.
What is FMEA?
What is the difference between FMEA and FMECA?
What makes up an FMEA?
Where did FMEA come from?
Who uses FMEA?
FMEA Standards and Guidelines
Different Types of FMEAs
Why do an FMEA?
When to do an FMEA analysis?
What’s a Failure Mode
What’s a Failure Mechanism?
The Constant Dilemma: Failure Mechanism or Failure Mode
Before You Start – Who You Need For an FMEA
How to Conduct an FMEA Analysis
Problems with the FMEA RPN
A Failure Modes and Effects Analysis (FMEA) is often one of the first steps you would undertake to analyse and improve the reliability of a system or piece of equipment.
During an FMEA you break the selected equipment down into systems, subsystems, assemblies and components and determine how these could fail.
You analyse why the failure would happen and what the consequence would be.
The analysis is completed by assigning preventive or corrective actions to improve reliability.
An FMEA analysis helps you to identify how a piece of equipment might fail. You do this based on experience with similar types of equipment. Or in some cases purely on the basis of sound engineering logic.
FMEAs are widely used in the development phase of a product. But are also used to analyse the failure of existing equipment already in operation. In that case, often the FMEA is used to review and optimise the preventive maintenance program.
An FMECA (Failure Modes, Effects and Criticality Analysis) is an extended FMEA that includes a risk assessment to prioritise the failure modes with the biggest impact.
These failure modes are then reviewed for possible mitigations to reduce the risk.
One method of prioritisation that is often used in FMECA is the Risk Priority Number (RPN). We’ll talk about that later in more detail.
It’s important to realise that an FMEA or FMECA is really an exercise in engineering analysis.
As such, an FMEA must be done in a structured process with the participation of the right subject matter experts. You simply can’t do an FMEA sitting at your desk on your own. Ok, strictly speaking, you could, but it would be a waste of your time.
The main elements of an FMEA are:
And in the case of a Failure Modes Effect and Criticality Analysis (FMECA) this is expanded to include a risk assessment of the potential failure modes that have been identified:
For the rest of this article, I am going to use FMEA and FMECA interchangeably.
I know that is strictly speaking not correct, but it makes the article so much easier to read.
Before we delve deeper into FMEAs and how to use them, let’s have a quick look at the origin of the FMEA.
Failure Modes Effect Analysis (FMEA) was developed by the American military in the late 1940s to investigate problems with munitions malfunctioning. As a result of those problems they developed a structured process to eliminate all potential root causes. 1
This was one of the first, highly structured and systemic approaches to failure analysis. This first approach was documented in MIL-P-1629 “Procedures for Performing a Failure Mode, Effects and Criticality Analysis” dated November 9, 1949.
The methodology worked well and was later adopted by the nuclear and aerospace industries including NASA. Apparently, NASA has gone as far as crediting the use of FMEAs to the success of the moon landings.
From there onwards, the often-quoted MIL-STD-1629 was developed by the US Navy.
The FMEA found its way into the private sector, initially through car manufacturer Ford in the 1970’s. Failure Mode Effects Analysis is now well used in the automotive industry, the energy sector and many more manufacturing industries.
In fact, even white goods manufacturers are nowadays using FMEA analysis during their design process.
And the FMEA has become a cornerstone in most quality management systems.
There are a several guidelines and standards that describe how to conduct a failure mode and effects analysis. The standards cover the FMEA process, the FMEA template to use and provide detailed technical guidance.
Currently, the main standards for FMEAs are:
In the many articles on the internet both SAE J1739 and MIL-STD-1629A are usually quoted as the main standards for FMEAs. But if you are working in a processing industry like Oil & Gas, Mining, Chemicals, Pharmaceuticals etc. I strongly believe you would be better off using the SAE Aerospace Recommended Practice ARP5580 or the IEC 60812 standard.
The are several approaches to conducting FMEAs. You might hear people talk about a system FMEA, a design FMEA (DFMEA) or a process FMEA (PFMEA).
Each are slightly different, use different worksheets and templates but always come back to the key concept of
If you have read some of my other articles, you’ll know that I like to keep things simple so that we can actually put them to use. When it comes to FMEAs in the world of maintenance and reliability I prefer to think of FMEAs as either:
The functional FMEA is something that in my mind is most suitable when you are still earlier in the lifecycle of your equipment. So when it is still on the drawing board and we may not have a full design the best approach is to use a functional FMEA.
When you have your equipment already in operation, potentially already for many years, the hardware FMEA can often be the easiest route to go down. You don’t have to worry about writing accurate function statements. Instead you break your equipment down into system, subsystem, components etc. to the level that is useful. And you determine the failure modes from there.
Because hardware FMEAs don’t determine functions they don’t naturally differentiate between design capacity versus what you actually need from your equipment.
And that’s one of the risks associated with hardware FMEAs.
If you don’t accurately determine the function of the equipment and its subsystems you may very well end up preventing failure modes that don’t really matter. Without an accurate function statement you may not be able to accurately assess the impact of a failure on the equipment and the plant as a whole. And you may end up preventive maintenance that is simply not worth the effort. We’ve all seen that before.
Quite simply a DFMEA is a Design Failure Modes and Effects Analysis, in other words, an FMEA that is executed during the design phase of a project. It is a tool to design for reliability.
A PFMEA or Process Failure Modes and Effects Analysis is an FMEA that is conducted on a process i.e. it analysis how a process may fail, what the consequence of that process failure might be and then identifies potential corrective actions.
Before we delve into the steps of conducting an FMEA and have a look at a FMEA template, let’s be clear about why we would be conducting an FMEA.
As Carl S. Carlson notes in his book Effective FMEAs 2 the main objective of an FMEA is to improve the design of a system, subsystem, component or even a process.
But in his book, Carlson also points to a number of other reasons why you would want to do an FMEA, for example
Remember the quote “You Can’t Maintain Your Way to Reliability” 3 that we talked about in the 9 Principles of Modern Maintenance?
Well, keeping that principle in mind we can immediately see a use for FMEAs in improving reliability. These would be:
And as we talked about earlier, during the design phase you would want to start with a functional FMEA early in the design process.
Start too late in the design process and you will struggle to influence the design. And you’ll end up with a design that has inherent reliability issues and these defects will either haunt your pant for the rest of its life. Or you’ll have to spend time and money to remove them once the plant is operational.
It’s much easier and cheaper to prevent these defects from ever occurring by tackling them early in the lifecycle. You do that through a functional FMEA during the design phase.
As you progress during the design you may want to migrate your functional FMEA towards a hardware FMEA to make sure you cover all (important) failure modes.
Many of us have simply inherited the plant we have. And we have inherited all the defects and reliability problems with it. So what to do?
We can use Root Cause Analysis to systemically go after our Bad Actors. We can ensure that when something fails, we fix it and improve it such that it won’t fail again.
But can we get a bit more proactive?
The answer is yes, by using FMEAs on installed equipment that is already operational we can pre-empt failure. We identify the credible failure modes and determine the best method to address them. That could be an optimized PM program or maybe a change to the equipment.
We’ve talked a lot about failure modes, but that’s one of those words in our industry that can cause a lot of confusion and misalignment.
Across industry people mean slightly different things when they talk about a failure mode. Especially when talking about failure modes and failure mechanisms.
So, what is a failure mode? And what is a failure mechanism?
You can google it, read Wikipedia and come up with a host of different definitions. One definition I found for a Failure Mode was “the specific condition causing a functional failure often best described by the condition after failure.” 4
Luckily in my trusted RCM bible by John Moubray I find a more succinct definition of a failure mode:
“Any event that causes a functional failure”
That sounds almost too simple, right?
But it’s not. It’s as simple as it needs to be. Start applying that definition and you’ll soon see the value of its simplicity.
In his book, Moubray expands further on failure modes and. Concludes they are often best described as a verb + noun statement that describe the physical state of the item. For example, a fractured axle or a deformed axle – both of which are separate failure modes.
An important thing to keep in mind is that description of the physical state of the item should be as accurate and meaningful as possible.
That means you need to try and avoid verbs like ‘fails’, ‘breaks’ or ‘malfunctions’. They give little or no indication what happened and are not an accurate description of the physical state of the item. As an example, a ‘broken axle’ could be a ‘fractured axle’ or a ‘deformed axle’. As we saw earlier these are two distinct failure modes. And each would require different mitigations.
The Failure Mechanism is then the “defect which is the underlying cause or sequence of causes that lead to a failure mode.”
Or in other words, a failure mechanism is a really a failure cause.
A failure mechanism states why the failure mode occurred. A single failure mode may have multiple failure mechanisms (or causes).
An easy way to check if your failure mode and failure cause align and make sense is simply adding the statement “due to” between your failure mode and failure cause (failure mechanism).
As an example, if the failure mode is “bearing seized” and the failure mechanism is “lack of lubrication” the statement becomes “bearing seized due to lack of lubrication”:
Determining to what level of detail you need to go to with your failure modes and failure mechanisms can be a bit of an art.
And that’s because what you would consider a failure cause at a system level could be deemed a failure mode when you go down to subsystem or even component level. In his book RCM II John Moubray gives a good example of this for a pump set, refer to figure 4.7 in Chapter 4 Failure Modes and Effects Analysis (FMEA).
The diagram below summarises this quite nicely:5
When read together, the failure mode and failure cause statements should however contain enough detail for it to be possible to select an appropriate failure management strategy, but not so much detail that excessive amounts of time are wasted on the analysis process itself. 6
As with many parts of a design process and similar to conducting a Root Cause Analysis, the success of your FMEA depends on who you have involved. It depends on who you have in the room when you’re doing your analysis.
So make sure you pull together a cross-functional team that includes the various engineering disciplines, safety, maintenance, operations but also less obvious disciplines like contracting and procurement.
Just like there are several approaches to FMEAs there are several FMEA templates that support each different approach.
However, the various FMEA templates are very similar. And that makes sense as they all aim to achieve very similar goals.
I’ve created a simple FMEA template that you can download here or by clicking on the banner below.
There is not a single, correct method for conducting an FMEA. And the various standards listed earlier in this article provide good guidance if you can get your hands on one of them.
Below is an outline of how you would go about conducting an FMEA. It is based on the process outlined in IEC 60812 with some simplifications here and there.
Before you start your FMEA you need to make sure you set yourself up for success. You need to map out the various steps of an FMEA and prepare for them. That means you need to:
This is another preparation step, but it’s so critical that I wanted to show it separately from the more general ‘plan and prepare’ step.
This step is all about being very clear on what’s in scope for the FMEA. So before you jump into the depths of your FMEA analysis make sure you reflect on the following and document this ahead of the actual FMEA workshop:
Where are the boundaries of what you’ll analyse?
How deep will you go?
Ideally at this stage you would break down the item you’re planning to analyse into a set of component block diagrams or a set of detailed design drawings.
Before delving into the potential failure modes you need to get clarity on the functions, requirements and specifications that apply to the system you are about to analyse.
At the start of the FMEA workshop you should go over this with all participants in the room to make sure everybody is fully aligned on what system is being analysed, to what level of detail, what the required functions and specifications are.
Now that you are clear on the system your analysing and have all the background documentation at hand you should be ready to start identifying potential failure modes.
Using the system, subsystem and component break down you prepared earlier identify all the potential failure mode for each component.
And here some of the best advice that comes from the most experienced FMEA facilitators is that you first work through your FMEA template column by column. That way you can make sure you are happy with the breakdown from system level down to component level and the associated failure modes before you delve into the detailed failure mode analysis and look at failure causes and failure effects. This is especially important when you do a functional FMEA like you do in RCM.
When it comes to determining failure modes, IEC 60812 standard for FMEAs suggest you consider the following as part of the identification process:
The IEC standard also clearly states that you need to ensure that failure modes are not omitted because of a lack of data. Instead you should keep them in your analysis and document what needs to be done to progress the analysis of these failure modes.
A good practice is to give each of the failure modes in your FMEA a unique code. This failure mode code helps with referencing within the FMEA and it helps with summarizing your analysis.
More importantly you want to bring this failure mode code into your CMMS. This will allow you to track whether this specific failure mode is occurring in your operation. And that allows you to determine the effectiveness of your FMEA and start a continuous improvement loop.
Once you have all failure modes identified you start analyzing each failure mode one at the time as shown in the FMEA process map earlier.
A first step in the failure mode analysis is determining the failure effect. A failure effect is the consequence of the failure mode in terms of the operation, function or status of the equipment you’re analysing. A failure effect may be caused by one or more failure modes of one or more components.
In your analysis document the failure effect clearly and comprehensively. Be specific on how the failure mode impacts the operation, function or status of the equipment.
Make sure you consider whether the failure effect is ‘local’ i.e. it only impacts the system or equipment that you’re analysing or whether the effects has a wider impact.
Does it have a safety impact on the user of the system?
Does it result in an impact on the full plant?
And make sure that check your logic by re-reading what you’ve documented from failure mode to failure cause and failure effect. Is your narrative clear, coherent, complete and logical? If not, make sure you fix it at this stage.
The failure Effect is usually captured in your FMEA template in two columns, one for the ‘Local Effect’ and a separate column for the ‘System Effect’. The use of two columns is recommended as it forces you to distinctly evaluate the effect at both levels.
Once you have this complete you need to make sure you capture the Severity of the failure effect in the FMEA template. The Severity is usually captured using a scale of 1 to 10 with the higher the more severe. It is this Severity that is used in the calculation of a Risk Priority Number (RPN). We’ll look at the use of the RPN towards the end of the article in more detail.
For each potential failure mode, you need to identify and describe the corresponding failure cause. Because a failure mode may have more than one failure cause, you should try to focus on the dominant failure causes.
When it comes to capturing failure causes it is important that you keep in mind the failure effect of the failure mode. A very severe failure effect may require you to really spend a lot more time on documenting the failure causes than you would for a failure mode with marginal impact.
In practice you start to get some iteration here between the description of the failure mode, the failure cause and the failure effect.
And that’s perfectly ok, just make sure you put your effort where it is justified to do so. Failure modes with limited effects should not be described and analysed in too much detail (to begin with).
Once you’ve landed on the failure cause you need to capture the likelihood or occurrence in the FMEA analysis sheet. Just like the Severity this is usually done on a scale of 1 to 10 (with 1 being very unlikely)
With the failure mode, failure cause and failure effect clearly documented you need to look at what existing controls you have to prevent the failure mode or at least mitigate the effect. This could be built in redundancy or procedural controls like inspecting and testing.
Some FMEA templates will simply have a single column for ‘Controls’ whereas if you follow IEC 60812 you would distinguish between controls that act as detection methods and you would have a separate column for what are called ‘failure compensating provisions’. These type of controls are design features that prevent or mitigate the failure effect.
Using two columns here would be beneficial as it forces you to formally evaluate both aspects distinctly during the FMEA.
Once you have all detection methods and failure compensating provisions accurately described you would capture what is called the Detection in your FMEA worksheet. Detection is a measure of the likelihood that a failure is detected using a scale of 1 to 10. With 1 being almost certain and 10 being absolutely uncertain.
Once you have all this analysed for the failure mode under consideration you will look at the risk associated with this failure mode.
And in an FMECA you would now determine the Risk Priority Number (RPN) for this failure mode as follows:
Risk Priority Number = Severity x Occurrence x Detection
The RPN would range from 0 to 1000 which gives you a quantitative way of assessing the risks in the FMEA analysis. It is important to realise that the RPN is not a continuous scale from 0 to 1000 but instead there are only 120 possible RPN outcomes.
Using the RPN determine where you need to put your effort and which failure modes will need to be mitigated first.
The RPN is intuitive and many people find it easy to use and help the prioritisation process. But there are definitely some risks associated with use of RPN.
Even the IEC 80612 standard highlights the commonly quoted deficiencies with the use of RPN. I won’t go into all of them as some are a bit academic in my view, but the most important problems with the use of RPN are:
Firstly, all three factors are weighted equally in the Risk Priority Number. This means that high severity events with low likelihoods and high detection rates could be overlooked if you took a simplistic view and just prioritised on the RPN score.
In processing industries like Oil & Gas, petrochemical, Pharmaceutical, Mining etc. these types of events could lead to Process Safety incidents with multiple fatalities. As such, these type of events must be analysed in great detail. And most companies, therefore, set additional rules around the use of the RPN requiring additional analysis, mitigation and review of high severity risks.
Secondly, there are several issues around the scale of the Risk Priority Number. As I already mentioned it is not a linear scale from 0 to 1000, but instead only 120 possible outcomes between 0 and 1000. These outcomes can be very susceptible to small changes so an increase in detection rate from 3 to 4 has a much bigger impact if the severity and occurrence are high than when the severity and occurrence are low.
S x O x D = 9 x 9 x 3 = 243
S x O x D = 9 x 9 x 4 = 324
S x O x D = 3 x 4 x 3 = 36
S x O x D = 3 x 4 x 4 = 48
The other issue associated with the scale is that it is not linear, which means that the difference in the RPN number may appear negligible when in fact it really shouldn’t
S x O x D = 6 x 4 x 2 = 48
S x O x D = 6 x 5 x 4 = 60
The RPN number here has only gone up by 25% when in fact the Occurrence rate has changed from 4 to 5 which in many Occurrence scales being used in industry actually means the event is twice as likely.
So… what does this all mean?
In simple words – beware with the use of Risk Priority Number. Use it, but use it wisely and never ever allow someone to take a purely numeric approach to Risk Priority Number.
The Math Behind The RPN
With the RPN you multiply Severity, occurrence and Detectability and each has a range between 1 and 10, which means the maximum value of the RPN is 10 x 10 x 10 = 1000.
But that does not mean there are 1000 different possible RPN values. In fact, mathematically there are only 120 different possible outcomes.
When you calculate the RPN only the total value matters so for example S x O x D = 4 x 3 x 5 = 60, but so is S x O D = 5 x 4 x 3 = 60. In other words the order does not matter, it is just the number of combinations you can make by picking 3 numbers out of a range of 10.
The formula to calculate that is:
Now an equally interesting aspect is that the number of combinations is not spread linearly over the scale from 1 to 1000. In fact, 86% of the possible RPN outcomes are below 360! 7
Once you have you prioritised list of failure modes complete with failure causes and failure effects, you need to determine the actions you need to take to reduce the risk profile.
These actions could be redesign of certain aspects, adding a built-in-test functionality, adding inspection or testing procedures in your maintenance regime.
It is important that as always with these actions that they are clear, assigned to a specific owner and are given due date. You then need to follow-up to make sure the actions are indeed closed out.
Once you have closed out an action you would typically show the mitigated Severity, Occurrence and Detection rates in the FMEA complete with the mitigated Risk Priority Number.
As you progress the FMEA process you need to document it. Don’t want till everything is done instead:
Probably the most commonly used tool for conducting FMEA’s would be… Microsoft Excel. To be honest it’s what I have used myself for most of my career and in many cases, it is all you need. But, it’s important to realise that there are more advanced, dedicated FMEA software solutions out there. And depending on your needs it may be a better choice to go with dedicated FMEA software than simply plough ahead with a spreadsheet.
Below is an overview of some of the most promising FMEA software packages that are out there on the market. Some FMEA software does just that, manage FMEA’s but there are also more advanced solutions on the market that let you manage the full maintenance strategy including implementation in your CMMS and/or more advanced reliability analyses.
To be clear: I have not used these tools myself so please do your own due diligence before you decide you go ahead with using FMEA software instead of a spreadsheet.
In the end, it is the analysis – the collaborative thinking process – that drives the value. Not the software tool.
ASENT FMECA Manager – Raytheon’s premiere Reliability and Maintainability tool suite. Includes a very powerful FMECA tool that combines FMECA, RCM Analysis, and Testability Analysis.
Byteworx – Powerful, cost-effective software for Failure Mode and Effects Analysis. The global choice of the Ford Motor Company, Byteworx FMEA is fully compliant with SAE J-1739 Third Edition.
FMEA-Pro – FMEA / FMECA software from Sphera. An all-in-one software solution that provides corporate consistency and assists with corporate compliance.
Quality Plus – FMEA software from Harpco Systems, Inc. Performs both Design and Process FMEAs.
Reliability Workbench – the Isograph workhorse Reliability Workbench is capable of a large amount of reliability analyses and also contains an FMEA / FMECA tool.
Windchill Risk and Reliability – The Windchill suite by Crimson Quality offers an FMEA (failure mode and effects analysis) and FMECA (including criticality) tool.
XFMEA – FMEA software from ReliaSoft. Provides expert support for all types of Failure Mode and Effects Analysis (FMEA).
Relyence Software – FMEA/FMECA software, available with hosting either in the cloud or installed on your hardware. Supports standard and custom design & process FMEAs, and Mil FMECAs.
Let us know how it went and leave a comment below: