FaultAnalysisHowTo - henk52/knowledgesharing GitHub Wiki
FaultAnalysisHowTo
References
- FaultAnalysisTemplate
- sfw02: System Fault Analysis Workshop ST-350, Student guide 2002 Revision E. Sun Microsystems, Inc.
Fault Analysis
See sfw02.1-4
- accept customer statement
- Stating the problem clearly: Extract the problem statement from the customer statement.
- Listing facts
- Identify information sources
- Collect relevant error messages
- Identify recent or relevant changes on the system
- Perform controlled comparisons
- Analyze the comparison results
- Determine the magnitude of the problem.
- Fault diagnosis
- List the likely causes
- Prioritize the list of likely causes
- Test the highest-priority cause
- Iterate through the list
- Take a corrective action
Value adding process
- Do not change observed facts to support hypothesis because.
- Clearly identify what is fact and what is hypothesis.
- This is especially true when trouble shooting in a team, make sure all are aware of whether you are stating fact or hypothesis. Or someone might think that you are staing something is a fact, and they wont investigate it since it's a fact.
Detailed Fault Analysis
ProblemStatement
Do not assume the cause of the fault sfw02.1-5.
Stating the problem clearly, extract the technical facts from the customers description.
Also describe what was expected.
Describe the problem in words so that an outsider would be able to understand the problem. Dont use jargon.
ListingFacts
IdentifyInformationSources
- FirstObservedBy - Who first observed the fault.
- LocationOfFault - Where is the fault located. What machine(s)/system(s).
- FirstObservedDate - When was the fault first observed?
- OtherResources - What other resources might be available;
- manuals
- http://au.sunsolve.sun.com/handbook_pub
- solution provider - e.g. http://www.sunsolve.sun.com
CollectRelevantErrorMessages
Collect the errors messages from (sfw02.1-7):
- Screen
- log files. e.g. /var/adm/messages
SymptombsAndConditions
- What are the Symptoms and what are the conditions for these symptoms to occur.
RecentChanges
- What changes has happened recently, anything changed, even when it may not seem related to the problem at hand(sfw02, 1-8).
ControlledComparison
Compare the faulty system with a similar system that does not display the same faults(sfw02.1-8).
Make the following observations:
- HW/SW
- Hardware
- Opearting Environment revisions
- Os version, etc.
- patch level.
- Application versions
- Does the system that is known to work, display the same symptombs and conditions as the faulty system?
- What events occurred in the environment, which might have contributed to the fault.
For instpiration on gathering the infor, see: http://sunsolve.sun.com/search/document.do?assetkey=1-9-82368-1
AnalyzingComparisonResults
Analyze the results generated by ControlledComparison to identify the causes of the system faults(sfw02.1-8).
Guidelines:
- Focus on one set of comparisons at a time.
- State facts only.
- Identify the differences between the faulty system and the non faulty system.
- Analyze the comparative facts about the systems for any similarities.
TIP While the similarities between systems do not directly identify the source of the fault, they can help eliminate most of the potential sources. Therefore, you can easily isolate the possible sources of the system fault(sfw02.1-8).
SizeAndMagnitude
- Identify the size or magnitude of the fault.
- What is the cost of the fault.
- This makes it possible to deteremine whether multiple systems are involved, which narrows the focus in a large networked environment(sfw02.1-9).
DiagnosisPhase
LikelyCauses
- List the likely causes(sfw02.1-11) then priorities them.
- Formulate a Hypotheses
- A hypothesis states the most probable cause of fault and i based on the data collected in the analysis phase(sfw02.1-12).
TestVerification
- Choose the test methology(sfw02.1-13)
- Factual
- Testing is based on past experience and on information gathered in the the ListingFacts section. This results in isolating the most probable cause.
- Realistic
- The propable cause must pass en experiment that determines conclusively whether it is the actual cause.
- Result-oriented
- Testing relies on previous experiences.