CodeQL - chunhualiao/public-docs GitHub Wiki

CodeQL vs. Datalog

GitHub's CodeQL is a powerful static code analysis tool that allows developers and security researchers to analyze code as if it were data[1][2]. It works by transforming source code into a queryable database, enabling users to write and run queries to identify vulnerabilities, bugs, and other issues in codebases[6].

Key Concepts

  1. CodeQL Database: CodeQL analyzes code using a database that represents the codebase. This database is generated by compiling the code and running CodeQL extractors[3].

  2. QL Query Language: CodeQL uses a specialized query language called QL, which allows users to define patterns and rules for code analysis[3]. QL is an object-oriented language that supports features like inheritance, encapsulation, and composition[1].

  3. Query Execution: After creating a CodeQL database, queries are executed against it. These queries can be pre-defined or custom-written to find specific patterns or potential issues in the code[7].

How CodeQL Works

  1. Code Extraction: CodeQL first extracts the code into a relational representation, creating a database of facts about the program[1].

  2. Query Writing: Users write queries using the QL language to ask questions about the code, such as "Show me all function calls to functions called 'eval'"[1].

  3. Analysis: CodeQL executes these queries against the database, identifying patterns and potential issues in the code[7].

  4. Result Interpretation: Query results are interpreted and presented in a meaningful way, often highlighting specific locations in the source code[7].

Use Cases

  • Vulnerability detection (e.g., SQL injection, cross-site scripting)
  • Code quality improvement
  • Open-source security scanning
  • Incident response and root cause analysis[6]

Getting Started

To begin using CodeQL:

  1. Set up the CodeQL CLI or use GitHub Actions for automation[3].
  2. Create a CodeQL database for your project[3].
  3. Learn the basics of the QL query language[3].
  4. Write and run queries to analyze your code[1].

CodeQL is a powerful tool that can significantly enhance code security and quality by allowing developers to systematically analyze their codebases for potential issues and vulnerabilities.

Citations:

CodeQL Toolchain: End-to-End Example

The CodeQL toolchain operates in several stages:

  1. Extracting a Code Database (QLDB)
  2. Querying the Database
  3. Interpreting the Results

I'll walk you through the entire CodeQL process with an example, showing intermediate outputs at each stage.


1. CodeQL Toolchain Overview

Step 1: Code Extraction (QLDB Generation)

  • Purpose: Convert source code into a structured database.
  • Tool: codeql database create
  • Intermediate Output: A QL database (QLDB), containing facts about the code (AST, control flow, data flow, etc.).

Step 2: Query Execution

  • Purpose: Run a CodeQL query on the extracted database.
  • Tool: codeql query run
  • Intermediate Output: Query results, typically in JSON or CSV format.

Step 3: Results Interpretation

  • Purpose: Analyze vulnerabilities, performance issues, or refactoring opportunities.
  • Output: Human-readable insights.

2. Step-by-Step Example

We'll analyze a simple C program to find function calls.

Sample Code (main.c)

#include <stdio.h>

void greet() {
    printf("Hello, World!\n");
}

int main() {
    greet();
    return 0;
}

Step 1: Extract CodeQL Database

Run the following command to extract a CodeQL database:

codeql database create mydb --language=cpp --source-root .
  • This extracts ASTs, control flow graphs (CFGs), and data flow into mydb/.
  • Intermediate Output: The extracted database contains:
    • relations/ → Tables of facts about function calls, variables, and AST nodes.
    • source/ → Indexed version of the source code.
    • log.txt → Extraction logs.

Step 2: Writing a CodeQL Query

Create a file find_function_calls.ql to find function calls in the code.

CodeQL Query: find_function_calls.ql

import cpp

from FunctionCall call
select call, call.getTarget()
  • import cpp → Loads the CodeQL standard library for C/C++.
  • FunctionCall call → Declares call as a function call.
  • call.getTarget() → Retrieves the function being called.
  • select call, call.getTarget() → Outputs each function call and its target.

Step 3: Running the Query

Run the query on the extracted database:

codeql query run --database=mydb find_function_calls.ql --output=result.csv
  • Intermediate Output: result.csv contains the query results.

Step 4: Inspecting the Results

After running the query, let's examine the intermediate results.

Intermediate Output (Function Calls Table)

Function Call Location Target Function
main.c:10 greet()
  • This means that greet() is called inside main() at line 10.

3. Visualizing the CodeQL Database

We can explore the CodeQL database manually:

Check Relations:

codeql database interpret-results --database=mydb --format=csv

Example of internal relations stored in the database:

Function Name File Line
greet main.c 4
main main.c 8
  • CodeQL internally models code as a database with tables for functions, calls, control flow edges, etc.

4. Summary of Toolchain

Step Command Intermediate Output
1. Extract CodeQL Database codeql database create .qldb with AST, CFG, DFG
2. Write Query find_function_calls.ql Query File
3. Run Query codeql query run Results in CSV
4. Inspect Results codeql database interpret-results Tables of function calls

Next Steps

Would you like to see:

  1. A more advanced CodeQL query (e.g., finding security vulnerabilities like buffer overflows)?
  2. More database internals (e.g., control flow graphs, variable tracking)?
⚠️ **GitHub.com Fallback** ⚠️