12 Empirical Studies in Software Testing - skylerto/Software-Testing GitHub Wiki
Empirical Studies in Software Testing
Empirical Studies
The word “empirical” means information gained by experience, observation, or experiment. The central theme in scientific method is that all evidence must be empirical which means it is based on evidence. In scientific method the word "empirical" refers to the use of working hypothesis that can be tested using observation and experiment.
Empirical research can be defined as "research based on experimentation or observation (evidence)". Such research is conducted to test a hypothesis.
Empirical studies (use of experience, observation) have become important for software engineering research.
Empirical Software Engineering
Empirical software engineering is a field of research that emphasize the use of empirical studies of all kinds to accumulate knowledge.
Test theories
Evaluate new process and tools
Approaches
Survey: interviews or questionnaires
Controlled Experiment: in the laboratory, involves manipulation of variables
Case Study: observational, often in-situ
Empirical Study Approaches - Surveys
Pose questions via interviews or questionnaires
Process: select variables and choose sample, frame questions that relate to variables, collect data, analyze and generalize from data
Statement coverage is achieved when all statements in a method have been executed at least once
Decision coverage is computed by considering both branch and individual condition coverage measures
Branch coverage is achieved when every branch from a node is executed at least once
Condition coverage reports the true or false outcome of each condition.
Modified condition/decision coverage extends branch and decision coverage with the requirement that each condition should affect the decision outcome independently
A recap on statistical correlation
Correlation coefficients are used to describe relationships among quantitative variables.
The sign ± indicates the direction of the relationship (positive or inverse), and the magnitude indicates the strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly predictable relationship). The actual range varies from books to books:
No correlation
(-0.1, 0.1)
Weakcorrelation
(0.1, 0.3), (-0.3, -0.1)
Moderate correlation - (0.3, 0.5), (-0.5, -0.3)
Strong correlation - (0.5, 1), (-1, 0.5)
Study Design
Select a set of (Java) program to study 2. Make test suites
Measure test suite coverage
Measure suite effectiveness
Mutation testing
Representative of fault detection effectiveness
Subject programs
Five open source Java programs
ApachePOI:APIformanipulatingMicrosoft
documents
Closure:JavaScriptoptimizingcompiler
HSQLDB:relationaldatabasemanagement system
JFreeChart:libraryforproducingcharts
Joda Time: open source replacement for Java Date and Time classes
Generating test suites
Identify all the test methods in a program
Generate new test suites of fixed size by randomly selecting a subset of these methods without replacement
We run these test suites and measure the code coverage using the CodeCover tool
Mutation Testing
Faults are introduced into the program by creating many versions of the program called mutants
Each mutant contains a single fault
Test cases are applied to the original program and to the mutant program
The goal is to cause the mutant program to fail, thus demonstrating the effectiveness of the test suite
Mutation testing is used to generate faulty programs in this study
The mutation testing tool is PIT
Mutation Testing Algorithm
Generateprogramtestcases
Runeachtestcaseagainsttheoriginalprogram
If the output is incorrect, the program must be modified and re- tested
If the output is correct go to the next step ...
Construct mutants using a mutation testing tool
Execute each test case against each alive mutant
If the output of the mutant differs from the output of the original
program, the mutant is considered incorrect and is killed - “Good test cases kill the mutants"
Once we find a test case that kills a mutant, we can forget the mutant and keep the test case. The mutant is dead
Two kinds of mutants survive:
Functionally equivalent to the original program: Cannot be killed
Killable: Test cases are insufficient to kill the mutant. New test cases must be created.
Mutation Coverage Criteria
Mutation Coverage (MC)
-For each mutant m, test requirements (TR) contain a requirement to “kill m”
Mutation score is the percentage of mutants killed
The mutation score for a set of test cases is the percentage of non-equivalent mutants killed by the test data
MutationScore=100*D/(N-E) - D: Dead mutants
N: Number of mutants
E: Number of equivalent mutants
A set of test cases is mutation adequate if its mutation score is 100%.
Findings
There is a low to moderate correlation between code coverage metrics and test suite effectiveness
If you code coverage is slow, the likelihood of exposing faults is low
Hence, code coverage is useful to identify under- tested parts of a program
However, stronger coverage do not provide greater insights into the effectiveness of the test suites
Hence, code coverage should not be used as a quality target because it is not a good indicator of test suite effectiveness
Potential Threats
What about other programs not written in Java?
What about other coverage metrics (e.g., data flow or concurrency coverage)?
It assumes any mutants that are not killed by the master suite (original test suites) are equivalent mutants
Overestimates the # of equivalent mutants - Scale to large size programs
Are faults seeded in mutation testing representative of real faults?