[GSoC 2017] Metrics Framework - pmd/pmd GitHub Wiki

Metrics Framework Project (GSoC 2017)

Clément Fournier

GSoC 2017 Final Work Product

Timeline

First coding period (June 1st - June 30th)

End-of-month goal: Making a functional visitor object which collects relevant statistics and the data structure to store them
Deliverables: Tested visitor object + DS
Details:
- Week 1: Finals until the 10th
- Weeks 2 - 3: Implementation of the visitor and data structure
- Week 4 (evaluation): Testing and brushing up

Second coding period (July 1st - July 28th)

End-of-month goal: Setting up the façade of the framework and some metrics. Refactor GodClassRule as proof of concept
Deliverables: Partial but mostly functional framework + refactored GodClassRule
Details:
- Week 1: Implementation of the bare bones of the façade and some metrics (ATFD, WMC, TCC)
- Weeks 2 - 3: Testing and fleshing out of the façade and metrics infrastructure + implementation of the rule tagging system and memoization infrastructure
- Week 4: Refactoring of GodClassRule
- Week 5 (evaluation): Security buffer. If not needed, this will be used to document existing code and add more tests

Final coding period (July 29th - August 29th)

End-of-month goal: Abstracting functionality into pmd-core. Implementing all metrics. Demonstrating the use of the framework with a new rule. Complete optional goals if possible.
Deliverables: Full documented framework + a new rule using the framework
Details:
- Week 1: Abstraction into pmd-core + implementation of the caching system
- Week 2: Writing necessary metrics for the new rule + document metrics creation process
- Week 3: Writing additional rule + document the usage of the framework in rules
- Week 4: Write additional metrics + brush up on everything (documentation + testing)
- Week 5 (evaluation): Security buffer

The project

Context

PMD's Java module already includes the GodClassRule which uses OO metrics (ATFD, WMC, TCC) to detect God Classes. It also includes rules such as CyclomaticComplexityRule which calculates the cyclomatic complexity of methods and reports too high values. However, two main problems impede the development of other metrics-based antipattern detection strategies in PMD:

PMD's rules operate only on single compilation units. As such, they cannot get an overview of the analysed system, which is necessary for the computation of some relational metrics.
The computation of the metrics is for now handled in the rules. That means metrics computation cannot be shared between rules, and would induce unnecessary overhead if many rules conditioned violations on the same metrics. It also clutters the rules' code with metric calculation.

This project tries to address these problems through the creation of a dedicated Metrics Framework.

Goals

Main goals

Create a unified and documented framework for the computation of OO metrics in \texttt{pmd-java}.
Abstract the functionality of that framework as much as possible into \texttt{pmd-core} to ease the implementation of similar frameworks for other object-oriented languages.
Use that framework to refactor GodClassRule as a proof of concept. Violation thresholds will be entered by the user as rule properties.
Create at least one additional rule that uses the framework, for example to detect Feature Envy.

Optional goal

Provide custom commands to access metrics in XPath rules.

Impact for PMD

The project will provide programmers who write rules with a straightforward interface to delegate the computation of various OO metrics. The simplicity of the interface is demonstrated on this page.
A whole family of rules could be implemented, tackling such antipatterns as Refused Parent Bequest, Feature Envy, or Tradition Breaker. This would give PMD an edge over similar software (e.g. Checkstyle, inFusion).
Performance constraints on rules that use metrics will be minimized as metrics won’t need to be recomputed several times.
Rules that use metrics would be made absolutely clearer, decluttering them from the actual computation of the metrics.

Proposed approach

AST visitor

A visitor object will roam the AST of the project before rules are applied, and after type resolution has done so itself. The visitor will gather statistics about the entities of the project and their relations, stored in a dedicated data structure. So as to provide the data structure with the most reusable information and non-redundant information, the visitor will mainly gather what I call method and field signatures. These signatures will comprise the following information:

For methods: the visibility (public, package private, protected, private), the role (constructor, getter or setter, static operation, method), and whether the method is abstract (a boolean);
For fields: the visibility, whether it is static and whether it is final.

These signatures can be used to count methods with e.g. a specific role or visibility using signature filters. Other information (\eg number of lines of code, number of variables used) can be inferred from the AST without needing to be stored in the data holder, as the computation of a metric on an entity can still make use of the AST node of the entity. Many metrics will mix AST exploration with signature matching, e.g. to find the number of distinct methods called in the measure method (CINT).

The rationale behind that is that the goal of the AST exploration is to provide the AST nodes that are measured with some insight about entities defined out of their compilation unit. After studying the metrics and the type of information they may use, I gather that the main information they need about those entities are what’s contained in those signatures, which mean there’s no need to store much more.

The data structure itself will reflect the package structure of the analysed project and can be queried from within metrics to look for specific signatures, or overridden definitions for example.

Interface with rules

The rules can request a specific metric to be computed from the statistics gathered at the previous stage. Some metrics will be fully computed beforehand (CC and CM) because they require a second pass on the AST to find usages. Other could be computed beforehand as some other metrics depend on them. This will need to be sorted out later. Regardless of the way and time the metric is calculated, the interface with rules is uniform.

The computed results will be provided through a façade class. A priori, this façade class and the statistics holder object are language independent and could be implemented directly in pmd-core, then extended by language specific implementations.

Performance

Most computations are delayed until requested by a rule, to save unnecessary computations.
Computed results will be memoized to speed up computation should another rule need the same metric.
Moreover, rules that use this framework will be marked, so that the expensive AST exploration will only be done in the case at least one rule needs it in the currently used ruleset.
Care will be taken to make the implementation of the framework \emph{thread-safe}, to allow for several files to be processed in parallel. In particular, the data structure and memoization infrastructure (which could be contained in the data structure itself) will be synchronized.
The framework will comply with the PMD's new caching system. This will allow incremental computation of metrics. Some metrics, however, will need to be recomputed each time, as they do not depend on the definition of the measured method but on its usages. But this could be made incremental as well, i.e., not all files will be reprocessed, but only those which have changed.

Git workflow

The project will be developed in a separate development branch (in my own fork). At least during July and August, my work will be merged into the main repository as often as good sense commands. The final deliverables of each month will also be merged at the end of the month.

Limitation: dependency with Type Resolution

Type Resolution plays an important role in the determination of most collaboration-related metrics, because these metrics require the ability to determine where the methods which are called in measured methods are defined. The only way that’s possible is to enable Type Resolution to resolve the type of method invocation and attribute accesses, which is one of the goals of the Complete Type Resolution for Java project, proposed by Bendegúz Nagy. Some metrics could be written without Type Resolution, but would find a simpler, faster, and more accurate implementation were they to use it.

Due to this dependency, which could not be foreseen, the scope of my project is conditioned on the completion of Bendegúz's project:

Should his project and my own be accepted, my project's scope would include all metrics for certain, and we are ready to adapt our timelines to make that work ;
Should only my project be accepted, my project will not include collaboration-related metrics. That would prevent the creation of rules to detect Intensive Coupling, Dispersed Coupling, and Shotgun Surgery until someone fixes Type Resolution. Note that all other metrics could be implemented. Moreover, I'd be ready to work on improving Type Resolution before the start of the coding period, but cannot make promises about my success on that front, as it seems very complex and I've not yet done deep research on this subject.