FaultAnalysisHowTo - henk52/knowledgesharing GitHub Wiki

FaultAnalysisHowTo

References

  • FaultAnalysisTemplate
  • sfw02: System Fault Analysis Workshop ST-350, Student guide 2002 Revision E. Sun Microsystems, Inc.

Fault Analysis

See sfw02.1-4

  1. accept customer statement
  2. Stating the problem clearly: Extract the problem statement from the customer statement.
  3. Listing facts
  • Identify information sources
  • Collect relevant error messages
  • Identify recent or relevant changes on the system
  • Perform controlled comparisons
  • Analyze the comparison results
  • Determine the magnitude of the problem.
  1. Fault diagnosis
  • List the likely causes
  • Prioritize the list of likely causes
  • Test the highest-priority cause
  • Iterate through the list
  1. Take a corrective action

Value adding process

  • Do not change observed facts to support hypothesis because.
  • Clearly identify what is fact and what is hypothesis.
    • This is especially true when trouble shooting in a team, make sure all are aware of whether you are stating fact or hypothesis. Or someone might think that you are staing something is a fact, and they wont investigate it since it's a fact.

Detailed Fault Analysis

ProblemStatement

Do not assume the cause of the fault sfw02.1-5.

Stating the problem clearly, extract the technical facts from the customers description.

Also describe what was expected.

Describe the problem in words so that an outsider would be able to understand the problem. Dont use jargon.

ListingFacts

IdentifyInformationSources

  • FirstObservedBy - Who first observed the fault.
  • LocationOfFault - Where is the fault located. What machine(s)/system(s).
  • FirstObservedDate - When was the fault first observed?
  • OtherResources - What other resources might be available;

CollectRelevantErrorMessages

Collect the errors messages from (sfw02.1-7):

  • Screen
  • log files. e.g. /var/adm/messages

SymptombsAndConditions

  • What are the Symptoms and what are the conditions for these symptoms to occur.

RecentChanges

  • What changes has happened recently, anything changed, even when it may not seem related to the problem at hand(sfw02, 1-8).

ControlledComparison

Compare the faulty system with a similar system that does not display the same faults(sfw02.1-8).

Make the following observations:

  • HW/SW
    • Hardware
  • Opearting Environment revisions
    • Os version, etc.
    • patch level.
    • Application versions
  • Does the system that is known to work, display the same symptombs and conditions as the faulty system?
  • What events occurred in the environment, which might have contributed to the fault.

For instpiration on gathering the infor, see: http://sunsolve.sun.com/search/document.do?assetkey=1-9-82368-1

AnalyzingComparisonResults

Analyze the results generated by ControlledComparison to identify the causes of the system faults(sfw02.1-8).

Guidelines:

  • Focus on one set of comparisons at a time.
  • State facts only.
  • Identify the differences between the faulty system and the non faulty system.
  • Analyze the comparative facts about the systems for any similarities.

TIP While the similarities between systems do not directly identify the source of the fault, they can help eliminate most of the potential sources. Therefore, you can easily isolate the possible sources of the system fault(sfw02.1-8).

SizeAndMagnitude

  • Identify the size or magnitude of the fault.
  • What is the cost of the fault.
  • This makes it possible to deteremine whether multiple systems are involved, which narrows the focus in a large networked environment(sfw02.1-9).

DiagnosisPhase

LikelyCauses

  • List the likely causes(sfw02.1-11) then priorities them.
  • Formulate a Hypotheses
    • A hypothesis states the most probable cause of fault and i based on the data collected in the analysis phase(sfw02.1-12).

TestVerification

  • Choose the test methology(sfw02.1-13)
  • Factual
    • Testing is based on past experience and on information gathered in the the ListingFacts section. This results in isolating the most probable cause.
  • Realistic
    • The propable cause must pass en experiment that determines conclusively whether it is the actual cause.
  • Result-oriented
    • Testing relies on previous experiences.

CorrectiveAction

FinalRepair

ProbableCause