03. Processing Source Code Into A Graph - VitalyRomanov/method-embedding GitHub Wiki

Source Code Data Format

Source code should be structured in the following way

source_code_data    
│
└───package1
│   │───source_file_1.py
│   │───source_file_2.py
│   └───subfolder_if_needed
│       │───source_file_3.py
│       └───source_file_4.py
│   
└───package2
    │───source_file_1.py
    └───source_file_2.py

An example of source code data can be found in this repository method-embedding\res\python_testdata\example_code. A package should contain self-sufficient code with its dependencies. Unmet dependencies will be labeled as non-indexed symbol.

Indexing with Docker

To create dataset need to first perform indexing with Sourcetrail. The easiest way to do this is with a docker container

docker run -it -v "/full/path/to/data/folder":/dataset mortiv16/sourcetrail_indexer

Indexing manually with Sourcetrail (alternative option)

This option works onl on Linux. Download a release from Github repo (latest tested version is 2020.1.117). Add Sourcetrail location to PATH

echo 'export PATH=/path/to/Sourcetrail_2020_1_117:$PATH' >> ~/.bashrc

SCT=/path/to/SourceCodeTool_repository
SOURCE_CODE=/path/to/source/code
DATASET_OUTPUT=/path/to/dataset/output
cd $SOURCE_CODE
echo "example\nexample2" > list_of_packages.txt
bash -i $SCT/scripts/data_collection/process_folders.sh < list_of_packages.txt
bash -i $SCT/scripts/data_extraction/process_sourcetrail.sh $SOURCE_CODE

Creating graph

Need to provide a sentencepiece model for subtokenization. Model trained on CodeSearchNet can be downloaded here.

SCT=/path/to/SourceCodeTool_repository
SOURCE_CODE=/path/to/source/code/indexed/with/sourcetrail
DATASET_OUTPUT=/path/to/dataset/output
python $SCT/SourceCodeTools/code/data/sourcetrail/DatasetCreator2.py --bpe_tokenizer sentencepiece_bpe.model --track_offsets --do_extraction $SOURCE_CODE $DATASET_OUTPUT

Creating graph without Sourcetrail index

There is an option to create a graph without creating a Soutrcetrail index. You need to create a DataFrame that stores source code. An example of needed format can be created with

python SourceCodeTools/code/data/ast_graph/build_ast_graph.py path_to_input path_to_output --create_test_data

The output is a DataFrame pickle (written with pandas.to_pickle). path_to_input will be ignored when --create_test_data is set. The output DataFrame has header

id,filecontent,package

Package and id are used to uniquely identify source code. No id can be repeated inside one package. The column package must be present.

Convert source code into a graph by running

python SourceCodeTools/code/data/ast_graph/build_ast_graph.py path_to_input path_to_output --bpe_tokenizer path/to/tokenizer/sentencepiece.model

When bpe_tokenizer is provided, names are subtokenized. All names and subwords are shared between different snippets of code. Files common_nodes.bz2, common_edges.bz2, common_filecontent.bz2, and common_offsets.bz2 will be created in output directory.

Given a small example with two snippets of code

test_code = pd.DataFrame.from_records([
    {"id": 1, "filecontent": "import numpy\nnumpy.array([1,2,3])", "package": "any_name_1"},
    {"id": 2, "filecontent": "from numpy import *\n", "package": "can use the same name here any_name_1"},
])

the output graph is Note that the common subword numpy is shared between two snippets of code.