CodeQL - chunhualiao/public-docs GitHub Wiki

CodeQL vs. Datalog

GitHub's CodeQL is a powerful static code analysis tool that allows developers and security researchers to analyze code as if it were data[1][2]. It works by transforming source code into a queryable database, enabling users to write and run queries to identify vulnerabilities, bugs, and other issues in codebases[6].

Key Concepts

CodeQL Database: CodeQL analyzes code using a database that represents the codebase. This database is generated by compiling the code and running CodeQL extractors[3].
QL Query Language: CodeQL uses a specialized query language called QL, which allows users to define patterns and rules for code analysis[3]. QL is an object-oriented language that supports features like inheritance, encapsulation, and composition[1].
Query Execution: After creating a CodeQL database, queries are executed against it. These queries can be pre-defined or custom-written to find specific patterns or potential issues in the code[7].

How CodeQL Works

Code Extraction: CodeQL first extracts the code into a relational representation, creating a database of facts about the program[1].
Query Writing: Users write queries using the QL language to ask questions about the code, such as "Show me all function calls to functions called 'eval'"[1].
Analysis: CodeQL executes these queries against the database, identifying patterns and potential issues in the code[7].
Result Interpretation: Query results are interpreted and presented in a meaningful way, often highlighting specific locations in the source code[7].

Use Cases

Vulnerability detection (e.g., SQL injection, cross-site scripting)
Code quality improvement
Open-source security scanning
Incident response and root cause analysis[6]

Getting Started

To begin using CodeQL:

Set up the CodeQL CLI or use GitHub Actions for automation[3].
Create a CodeQL database for your project[3].
Learn the basics of the QL query language[3].
Write and run queries to analyze your code[1].

CodeQL is a powerful tool that can significantly enhance code security and quality by allowing developers to systematically analyze their codebases for potential issues and vulnerabilities.

Citations:

CodeQL Toolchain: End-to-End Example

The CodeQL toolchain operates in several stages:

Extracting a Code Database (QLDB)
Querying the Database
Interpreting the Results

I'll walk you through the entire CodeQL process with an example, showing intermediate outputs at each stage.

1. CodeQL Toolchain Overview

Step 1: Code Extraction (QLDB Generation)

Purpose: Convert source code into a structured database.
Tool: codeql database create
Intermediate Output: A QL database (QLDB), containing facts about the code (AST, control flow, data flow, etc.).

Step 2: Query Execution

Purpose: Run a CodeQL query on the extracted database.
Tool: codeql query run
Intermediate Output: Query results, typically in JSON or CSV format.

Step 3: Results Interpretation

Purpose: Analyze vulnerabilities, performance issues, or refactoring opportunities.
Output: Human-readable insights.

2. Step-by-Step Example

We'll analyze a simple C program to find function calls.

Sample Code (`main.c`)

#include <stdio.h>

void greet() {
    printf("Hello, World!\n");
}

int main() {
    greet();
    return 0;
}

Step 1: Extract CodeQL Database

Run the following command to extract a CodeQL database:

codeql database create mydb --language=cpp --source-root .

This extracts ASTs, control flow graphs (CFGs), and data flow into mydb/.
Intermediate Output: The extracted database contains:
- relations/ → Tables of facts about function calls, variables, and AST nodes.
- source/ → Indexed version of the source code.
- log.txt → Extraction logs.

Step 2: Writing a CodeQL Query

Create a file find_function_calls.ql to find function calls in the code.

CodeQL Query: `find_function_calls.ql`

import cpp

from FunctionCall call
select call, call.getTarget()

import cpp → Loads the CodeQL standard library for C/C++.
FunctionCall call → Declares call as a function call.
call.getTarget() → Retrieves the function being called.
select call, call.getTarget() → Outputs each function call and its target.

Step 3: Running the Query

Run the query on the extracted database:

codeql query run --database=mydb find_function_calls.ql --output=result.csv

Intermediate Output: result.csv contains the query results.

Step 4: Inspecting the Results

After running the query, let's examine the intermediate results.

Intermediate Output (Function Calls Table)

Function Call Location	Target Function
`main.c:10`	`greet()`

This means that greet() is called inside main() at line 10.

3. Visualizing the CodeQL Database

We can explore the CodeQL database manually:

Check Relations:

codeql database interpret-results --database=mydb --format=csv

Example of internal relations stored in the database:

Function Name	File	Line
`greet`	`main.c`	`4`
`main`	`main.c`	`8`

CodeQL internally models code as a database with tables for functions, calls, control flow edges, etc.

4. Summary of Toolchain

Step	Command	Intermediate Output
1. Extract CodeQL Database	`codeql database create`	`.qldb` with AST, CFG, DFG
2. Write Query	`find_function_calls.ql`	Query File
3. Run Query	`codeql query run`	Results in CSV
4. Inspect Results	`codeql database interpret-results`	Tables of function calls

Next Steps

Would you like to see:

A more advanced CodeQL query (e.g., finding security vulnerabilities like buffer overflows)?
More database internals (e.g., control flow graphs, variable tracking)?