CodeQL - chunhualiao/public-docs GitHub Wiki
GitHub's CodeQL is a powerful static code analysis tool that allows developers and security researchers to analyze code as if it were data[1][2]. It works by transforming source code into a queryable database, enabling users to write and run queries to identify vulnerabilities, bugs, and other issues in codebases[6].
-
CodeQL Database: CodeQL analyzes code using a database that represents the codebase. This database is generated by compiling the code and running CodeQL extractors[3].
-
QL Query Language: CodeQL uses a specialized query language called QL, which allows users to define patterns and rules for code analysis[3]. QL is an object-oriented language that supports features like inheritance, encapsulation, and composition[1].
-
Query Execution: After creating a CodeQL database, queries are executed against it. These queries can be pre-defined or custom-written to find specific patterns or potential issues in the code[7].
-
Code Extraction: CodeQL first extracts the code into a relational representation, creating a database of facts about the program[1].
-
Query Writing: Users write queries using the QL language to ask questions about the code, such as "Show me all function calls to functions called 'eval'"[1].
-
Analysis: CodeQL executes these queries against the database, identifying patterns and potential issues in the code[7].
-
Result Interpretation: Query results are interpreted and presented in a meaningful way, often highlighting specific locations in the source code[7].
- Vulnerability detection (e.g., SQL injection, cross-site scripting)
- Code quality improvement
- Open-source security scanning
- Incident response and root cause analysis[6]
To begin using CodeQL:
- Set up the CodeQL CLI or use GitHub Actions for automation[3].
- Create a CodeQL database for your project[3].
- Learn the basics of the QL query language[3].
- Write and run queries to analyze your code[1].
CodeQL is a powerful tool that can significantly enhance code security and quality by allowing developers to systematically analyze their codebases for potential issues and vulnerabilities.
Citations:
- [1] https://github.blog/developer-skills/github/codeql-zero-to-hero-part-2-getting-started-with-codeql/
- [2] https://book.martiandefense.llc/notes/product-security-engineering/sast-sca/codeql-for-beginners
- [3] https://shecancode.io/getting-started-with-githubs-codeql/
- [4] https://learn.microsoft.com/en-us/training/modules/code-scanning-with-github-codeql/
- [5] https://github.com/readme/guides/custom-codeql-queries
- [6] https://infosec-jobs.com/insights/codeql-explained/
- [7] https://codeql.github.com/docs/codeql-overview/about-codeql/
- [8] https://github.blog/developer-skills/github/codeql-zero-to-hero-part-1-the-fundamentals-of-static-analysis-for-vulnerability-research/
The CodeQL toolchain operates in several stages:
- Extracting a Code Database (QLDB)
- Querying the Database
- Interpreting the Results
I'll walk you through the entire CodeQL process with an example, showing intermediate outputs at each stage.
- Purpose: Convert source code into a structured database.
-
Tool:
codeql database create
- Intermediate Output: A QL database (QLDB), containing facts about the code (AST, control flow, data flow, etc.).
- Purpose: Run a CodeQL query on the extracted database.
-
Tool:
codeql query run
- Intermediate Output: Query results, typically in JSON or CSV format.
- Purpose: Analyze vulnerabilities, performance issues, or refactoring opportunities.
- Output: Human-readable insights.
We'll analyze a simple C program to find function calls.
#include <stdio.h>
void greet() {
printf("Hello, World!\n");
}
int main() {
greet();
return 0;
}
Run the following command to extract a CodeQL database:
codeql database create mydb --language=cpp --source-root .
- This extracts ASTs, control flow graphs (CFGs), and data flow into
mydb/
. -
Intermediate Output: The extracted database contains:
-
relations/
→ Tables of facts about function calls, variables, and AST nodes. -
source/
→ Indexed version of the source code. -
log.txt
→ Extraction logs.
-
Create a file find_function_calls.ql
to find function calls in the code.
import cpp
from FunctionCall call
select call, call.getTarget()
-
import cpp
→ Loads the CodeQL standard library for C/C++. -
FunctionCall call
→ Declarescall
as a function call. -
call.getTarget()
→ Retrieves the function being called. -
select call, call.getTarget()
→ Outputs each function call and its target.
Run the query on the extracted database:
codeql query run --database=mydb find_function_calls.ql --output=result.csv
-
Intermediate Output:
result.csv
contains the query results.
After running the query, let's examine the intermediate results.
Function Call Location | Target Function |
---|---|
main.c:10 |
greet() |
- This means that
greet()
is called insidemain()
at line 10.
We can explore the CodeQL database manually:
codeql database interpret-results --database=mydb --format=csv
Example of internal relations stored in the database:
Function Name | File | Line |
---|---|---|
greet |
main.c |
4 |
main |
main.c |
8 |
- CodeQL internally models code as a database with tables for functions, calls, control flow edges, etc.
Step | Command | Intermediate Output |
---|---|---|
1. Extract CodeQL Database | codeql database create |
.qldb with AST, CFG, DFG |
2. Write Query | find_function_calls.ql |
Query File |
3. Run Query | codeql query run |
Results in CSV |
4. Inspect Results | codeql database interpret-results |
Tables of function calls |
Would you like to see:
- A more advanced CodeQL query (e.g., finding security vulnerabilities like buffer overflows)?
- More database internals (e.g., control flow graphs, variable tracking)?