CodeQL vs. Datalog - chunhualiao/public-docs GitHub Wiki
GitHub CodeQL and Datalog share similarities as both are declarative query languages used for analyzing structured data, but they differ significantly in purpose, expressiveness, and underlying execution models. Here's a comparison:
Similarities:
-
Declarative Query Language:
- Both CodeQL and Datalog follow a declarative approach where users specify what they want to find rather than how to find it.
- They use logical rules and predicates instead of procedural code.
-
Based on Logic Programming:
- CodeQL is inspired by Datalog, a rule-based logic programming language.
- Both use Horn clauses (i.e., rules of the form
head :- body.
).
-
Fixed-Point Evaluation:
- Both support recursive queries that rely on fixed-point computations to compute results iteratively.
-
Efficient Query Evaluation:
- They both optimize execution using relational algebra techniques, such as indexing, joins, and materialized views.
Differences:
Feature | CodeQL | Datalog |
---|---|---|
Purpose | Designed for code analysis and security auditing, querying ASTs, control flow graphs, and data flow in programs. | General-purpose deductive database language, often used in logic programming, program analysis, and knowledge representation. |
Data Model | Works on program representations (ASTs, control flow graphs, intermediate representations of code). | Works on relational tuples (similar to Prolog but with a database-oriented execution model). |
Syntax | More user-friendly, inspired by SQL with an object-oriented flavor (from ... where ... select ... ). |
Traditional logic programming syntax (fact :- condition. ). |
Expressiveness | Supports object-oriented features, predicates with parameters, and more complex pattern matching on program structures. | More mathematical and general-purpose, designed for relational queries and recursive computations. |
Use of Recursion | Frequently used to track data flow and control flow across multiple steps in code. | Native recursion support but primarily used for transitive closure queries in database-like structures. |
Practical Application | Used in security vulnerability detection, static analysis, and finding bugs in codebases (GitHub, LGTM, Semmle). | Used in database queries, reasoning systems, and knowledge representation. |
Summary
- If you're familiar with Datalog, CodeQL should feel intuitive since it builds on Datalog principles but extends them for code analysis.
- CodeQL is more specialized for program analysis, whereas Datalog is more general-purpose for logical inference and relational queries.
Would you like a CodeQL example compared to an equivalent Datalog query?
Example: Finding Function Calls in a Codebase
Let’s compare CodeQL and Datalog by writing a query that finds all function calls in a given codebase.
1. CodeQL Example
CodeQL is designed for code analysis, so it provides an object-oriented API over an abstract syntax tree (AST). Here’s a simple query that finds all function calls in a C++ codebase:
import cpp
from FunctionCall call
select call, call.getTarget()
Explanation:
import cpp
→ Loads the C++ CodeQL standard library.FunctionCall call
→ Declares a variablecall
that represents all function call expressions.call.getTarget()
→ Retrieves the function being called.select call, call.getTarget()
→ Outputs the call and the function being called.
2. Equivalent Datalog Query
Datalog operates on relational facts instead of an object-oriented API. Let’s assume we have facts representing function calls:
% Facts:
call("foo.c", 10, "printf").
call("bar.c", 20, "malloc").
% Rule to extract function calls:
function_call(File, Line, Func) :- call(File, Line, Func).
Explanation:
- The fact
call("foo.c", 10, "printf")
represents a call toprintf
at line 10 infoo.c
. - The rule
function_call(File, Line, Func)
simply retrieves function calls. - Querying
function_call(File, Line, Func)
will return:foo.c 10 printf bar.c 20 malloc
Key Differences in the Example
Feature | CodeQL | Datalog |
---|---|---|
Data Representation | Works with semantic program representations (AST, control/data flow). | Works with explicit relational facts (e.g., call(File, Line, Func) ). |
Syntax | SQL-like with an object-oriented API. | Prolog-like relational logic. |
Built-in Analysis | Provides rich APIs (e.g., call.getTarget() for function resolution). |
Requires manually defining facts and inference rules. |
Use Case | Used for code security analysis and static analysis. | Used for general relational reasoning and deductive databases. |
Conclusion
- CodeQL is built on Datalog-like logic but extends it with object-oriented features and program analysis primitives.
- Datalog is a general-purpose logic language that could be used for similar tasks but lacks built-in code analysis capabilities.
Would you like a deeper dive into recursion or control flow analysis in CodeQL vs. Datalog?