Project Ideas [Mature] - pmd/pmd GitHub Wiki

This document provides brief descriptions of candidate projects for undergrad students through programs such as Google Summer of Code, or ITBA's final project.

The objective is to define clearly the objective, identified tasks to perform, and expected output.

Please note: This page shows ideas that have matured enough and are ready to be taken on. Inception stage ideas can be found at Project Ideas [Inception] and on the more general Roadmap page.

Older ideas, already taken can be found at Taken Ideas Archive & Status.

Helpful info:

Google Summer of Code Logo

Current Status:

We are in the process of maturing the idea list, this is a draft from carry over ideas from previous years.

Please note, that we expect to have only limited seats available for GSoC. We have only two mentors, which limits the students, we can accept for the GSoC program also to at most two students.

Table of Contents:

  1. Easy installation packages
  2. Enhanced PMD Regression Tester
  3. Ruleset Builder / Configuration Editor
  4. Ruleset Schema Migration Tool
  5. New Rules
  6. Data Flow Analysis and Control Flow Graph

Easy installation packages

Rationale

There are various ways of packaging possible. PMD currently distributes a zip archive file, which contains PMD including all dependencies and all languages.

For many different operating systems, there are many packages. The goal would be, to produce an easier to install package, like a Debian .deb-archive, a Windows install executable (e.g. via nullsoft scriptable install system), or provide recipes for Chocolatey, Homebrew, Fink, Snappy.

SDKman

Alternatively or additionally, we could distribute PMD via SDKman. Using SDKman has these advantages/disadvantages:

  • (+) convenient
  • (+) cross-platform
  • (+) some developers may already be using it
  • (+) no long review to get into approved repos (eg ubuntu)
  • (+) open source
  • (+) supports keeping several parallel versions
  • (-) not OS-native (eg .deb)
  • (-) less easy for true newbies
  • (-) no incremental updates

pmd packager tool

Another way to approach this, is to provide a own "pmd packager tool". Not every user needs all languages. So, there could be tailored packages, that only include a few languages. One arrangement could be to have a package that only includes languages, that have full PMD support (all languages, that have rules and can analyze source code) additionally to a package, that includes all languages (CPD + PMD).

  • Biggest modules: Apex, Scala
  • Example CLI:
pmdp add-module apex
pmdp add-module "https://pmd-contributions.org/languages/php"
pmdp update

Related (old) feature requests:

Existing packages:

Impact

Easier access by users to PMD. Keeping up to date with latest versions is further simplified even when not using Maven / Gradle.

Expected outcome

Official packages and formulas are made available for at least one (additional) system. Keeping them up to date and published is stream-lined in our Maven build process. The deb package would be hosted at launchpad for at least debian and ubuntu, other packages could be hosted for download at our sourceforge space or at bintray.

Desirable skills

Basic Shell Scripting, Intermediate Maven usage, Multi-OS

What the student will learn

Package management, build automation.

Proposed mentors

Ruleset Builder / Configuration Editor

Rationale

The recommended way of using PMD is to create own, custom rulesets, see Best Practices. However, currently you need to lookup the rule index manually, e.g. for Java and write up the ruleset XML file in a text editor manually.

With a GUI tool, it would become much simpler to create a custom ruleset: Browsing the available rules, reading the description and examples, picking the rule and setting the rule properties as needed.

The PMD Eclipse Plugin has some UI, that comes close to such a UI tool (see Workspace Preferences - Global Rule Management), but it is not very intuitive to use. This can be used as inspiration.

The ruleset builder can also help in validating and migrating old rulesets, e.g. migrating from PMD 5.x to PMD 6, where we moved the rules into categories.

Related work / features:

Impact

Easier usage of PMD as it is intended to be used: With custom rulesets. It lowers the barrier to start using PMD.

Expected outcome

A GUI tool that is a standalone web application. The tool allows to browse built-in rules (the rules that are shipped with PMD itself). The selected rules can be saved in a new ruleset file, that can be instantly used to run PMD. The selected rules can also be configured via the properties.

A stretch goal would be to support custom rules as well. This will require some additional configuration and potentially classloading.

There is one additional challenge to think about: With PMD 7 we are planning to rework the ruleset schema, see Schema Version 7.0.0. The solution should therefore design a data model, that is independent of the actual schema that is used.

Desirable skills

XML, JavaScript, maybe Java (depending on the web framework).

What the student will learn

GUI Design, Web application

Proposed mentors

Ruleset Schema Migration Tool

Rationale

For major versions of PMD, we might make changes to the ruleset schema, that are not entirely backwards compatible. In order to make it easier to use the new PMD version with an old ruleset, a migration tool can be useful.

There is already a migration tool for converting pre-6 rulesets to 6.x rulesets: https://github.com/asarkar/pmd-migration-tool

The changes from pre-6 to 6 were mostly reorganizing the rules from rulesets into categories. The changes from 6 to 7 will be a bit more drastic, see Schema Version 7.0.0. While the changes are more drastic, the conversion probably can be solved through "simple" transformations.

The tool could be later on integrated into a Ruleset Configuration Editor, see the project above, allowing to load/import an "old" ruleset.

Impact

Faster adoption of new PMD versions. Simplified code in PMD, since it only needs to deal with one version and there is no compatibility layer included.

Expected outcome

A tool (not necessarily command line), that takes an existing ruleset as input and recreates the same ruleset in the new format. Any problems like missing rules, etc. should be reported.

Additionally a library, that can be used by the Ruleset Configuration Editor for reading old rulesets.

Desirable skills

XML, maybe XSLT

What the student will learn

Data transformation, integration of different tools, compatibility

Proposed mentors

New Rules

Rationale

PMD ships with a lot of rules for Java already, but there is one category that needs improvement: Security.

The project will be, to implement new rules to find security flaws in Java applications. Well known security problems can be seen at OWASP - the Open Web Application Security Project. They provide the Top 10 List of web application security risks (e.g. SQL injection). Based on this list, new rules can be selected. OWASP also provides a lot of test cases in their Benchmark Project.

There is another project, that started this effort: https://github.com/GDSSecurity/GDS-PMD-Security-Rules This is however based on an old version of PMD.

And there is an updated version of the same here: https://github.com/albfernandez/GDS-PMD-Security-Rules

Unfortunately the license is not completely clear: It's either GPL-2.0 or RPL-1.5.

General ideas for security rules are e.g (in)correct usage of the java cryptography APIs (like hard-coded passwords, usage of weak algorithms, ...).

Other rule ideas are already tracked in our issue tracker: https://github.com/pmd/pmd/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aa%3Anew-rule

Rules, that were not referenced by any ruleset (and thus have not been used), have been removed with PMD 7. The reason is, that these rules were not fully implemented and had no tests. But here is a list of the rules, in case they seem useful - in this case, we might think about reimplementing them:

  • GenericLiteralCheckerRule: rule that matches a (string) literal against a configurable regex. Could be used to implement AvoidUsingHardCodedIP.
  • StringConcatenationRule: Not sure, maybe it was intended to look for patterns like "a" + "b" + "c" + "d" in combination with (for) loops.
  • UselessAssignment: It's a dataflow based rule. Detects, if a variable is assigned, but not used - e.g. because another value is assigned. It seems to be the same, what the rule DataflowAnomalyAnalysis can detect: DD (definition definition) and DU (definition undefinition).
  • GenericClassCounterRule: A generic rule that can be configured to "count" classes of certain type based on either their name (full name, prefix, suffixes anything can be matched with a regex), and/or their type.
  • PositionalIteratorRule: Finds while-loops that loop over a iterator and advance the iterator via next more than once: e.g. while (it.hasNext()) { Object o1 = it.next(); Object o2 = it.next(); }.
  • CodeInCommentsRule: That could be implemented by some regexes, that are applied against any comments. The regex could look for assignments (.*=.*) and method calls (.*\(.*\)).
  • HeaderCommentsRule: Restrictions regarding the legal placement and content of the file header. Could be used to verify/enforce a common license header.

Other ideas:

Impact

PMD can be used to identify security issues.

Expected outcome

New rules including unit tests. If necessary, improvements to PMD itself, in order to implement the rules.

Desirable skills

Java, XPath

What the student will learn

Analyzing source code, understanding security issues

Proposed mentors

Data Flow Analysis and Control Flow Graph

Rationale

Data-flow analysis (DFA) is a useful technique to reason about the static structure of a program. In compilers, it's essential to emit code and optimise it. In static analysis tools like PMD, they can be used to e.g.

  • Spot useless assignments:
int a = 2; // the value "2" is useless, it's replaced in every branch of the if.
if (..) {
    a = 3;
} else {
    a = 4;
}
  • Spot common subexpressions:
(a + b) - (a + b)/4 // (a+b) will never change value during the evaluation of the expression, it could be moved somewhere else
  • Spot loop-invariant code, and suggest to move it outside the loop

PMD already has utilities to compute a data-flow graph on Java source. Those utilities are barely used in PMD, because they have several problems as of now:

  • They're out of data with the Java features introduced after Java 1.5. That includes e.g. lambda expressions, try-with-resources, multicatch blocks, probably foreach loop as well
  • The API is very confusing, unintuitive and under-documented
  • They're very inefficient. The data-flow graph is computed greedily for the entire source file, even though rules may only want to know about a subset of it (and most rules don't even need it).

Open issues with the current implementation are tracked on the label in:data-flow.

The codebase itself is heavily outdated as it was written pre Java 5. We've come to an understanding that they can be replaced entirely and redesigned from scratch. That will give the student a lot of liberty to make their own design decisions.

Impact

  • A reliable DFA would allow new rules to be written, and to simplify some existing rules. Especially for security rules, knowing from where a data entity coming, is important to figure out, whether the usage is a security risk or not (keyword: tainting).
  • A modern and maintainable implementation would allow us to improve on it incrementally as features are added to the Java language (e.g. switch expressions in Java 12).

Expected outcome

Goals

The student should design a subsystem that:

  • Can compute the control-flow graph (CFG) of a block of code
    • This should be done on-demand, i.e. when a user asks for it.
  • Exposes a simple and high-level API to access and query the CFG from rules

The student should at least refactor the rule DataflowAnomalyAnalysis to make use of that new framework. That rule is meant to report useless assignments.

Non-goals

To simplify the problem to solve, we state explicitly that the following are not expected to be tackled:

  • We focus only on local CFA (i.e., within a single method). Interprocedural analysis is a non-goal.
  • We focus only on the analysis of Java source code. Analysing other languages is a non-goal.

Desirable skills

Java

What the student will learn

API design in Java, compiler theory (CFA, DFA)

Proposed mentors

⚠️ **GitHub.com Fallback** ⚠️