04. Graph Format Description - VitalyRomanov/method-embedding GitHub Wiki
The graph dataset has the following structure
graph_dataset
│
└───no_ast
│ │───common_call_seq.bz2
│ │───common_edges.bz2
│ │───common_function_variable_pairs.bz2
│ │───common_nodes.bz2
│ │───common_source_graph_bodies.bz2
│ └───node_names.bz2
│
└───with_ast
│───common_call_seq.bz2
│───common_edges.bz2
│───common_function_variable_pairs.bz2
│───common_nodes.bz2
│───common_source_graph_bodies.bz2
└───node_names.bz2
no_ast
contains graph built from global relationships only. with_ast
contains graph with AST nodes and edges. Two main files for building the graph are common_nodes.bz2
and common_edges.bz2
.
Pandas DataFrame pickle containing node data. The columns are
id,type,serialized_name,mentioned_in,string
The first column is global node id, the second is node type, third is a node name. Column mentioned_in
stores ids of functions where AT nodes have appeared. The column string
has string representation for some of AST nodes.
Pandas DataFrame pickle containing edge data. The columns are
id,type,source_node_id,target_node_id,file_id,mentioned_in
As with nodes, the column mentined_in
stores id of a function where a given edge has appeared. By grouping edges with identical mentioned_in
id one can get subgraphs for separate functions.
Pandas DataFrame pickle containing edges for training Call Sequence Prediction Objective. These edges show which function is called after a given function.
Pandas DataFrame pickle containing data for Variable Use Objective. Each function is connected to a set of variable names used inside this function.
Pandas DataFrame pickle containing ids of functions, their source code, the list of spans that represent AST nodes appeared in the body of this function.
Pandas DataFrame pickle containing data used for Name Prediction Objective. Functions and variables are connected to their names.
The files are stored as pickled pandas table (read with pandas.read_pickle
) and probably not portable between platforms. One can view the content by converting table into the csv
format
python SourceCodeTools/code/data/sourcetrail/pandas_format_converter.py common_nodes.bz2 csv
Source Code Gaph is based on AST. However, we introduce several modification. Some AST nodes have simplified representations. For example, we do not reprsent strings using their content, but simply as a string constant. We argue that for program understanding nde types plays more important role. This allows us to minimize the number of unique nodes. Not all edges that are present in AST appear in our Source Code Graph. Some edges are omitted. The list of all edges take from AST can be found in table below.
Click to for edge list
NodeType | Available Edges | Used | Comment |
---|---|---|---|
Add | |||
And | |||
AnnAssign | annotation | x | |
simple | Flag set to 1 when target is a complex expression. Do not need this. | ||
target | x | ||
value | x | ||
Assert | msg | x | |
test | x | ||
Assign | targets | x | |
type_comment | Not clear how this is used. | ||
value | x | ||
AsyncFor | body | x | |
iter | x | ||
orelse | x | ||
target | x | ||
type_comment | Not clear how this is used. | ||
AsyncFunctionDef | args | x | |
body | x | ||
decorator_list | x | ||
name | x | ||
returns | x | ||
type_comment | Not clear how this is used. | ||
AsyncWith | body | x | |
items | x | ||
type_comment | Not clear how this is used. | ||
Attribute | attr | x | |
ctx | Whether variable in store or load context. Does not improve program understanding. | ||
value | x | ||
AugAssign | op | x | |
target | x | ||
value | x | ||
Await | value | x | |
BinOp | left | x | |
op | x | ||
right | x | ||
BitAnd | |||
BitOr | |||
Break | |||
Bytes | s | Converted into a node directly | |
Call | args | x | |
func | x | ||
keywords | x | ||
ClassDef | bases | Inheritance handled by sourcetrail | |
body | x | ||
decorator_list | |||
keywords | Keywords for metaclasses. Feature is rarely used. | ||
name | x | ||
Compare | comparators | x | |
left | x | ||
ops | x | ||
Constant | kind | Converted into a node directly | |
value | |||
Continue | |||
Delete | targets | x | |
Dict | keys | x | |
values | x | ||
DictComp | generators | x | |
key | x | ||
value | x | ||
Div | |||
Ellipsis | |||
Eq | |||
ExceptHandler | body | x | |
name | Specific name does not seem to improve understanding of the program. Exception type has more meaning. | ||
type | x | ||
Expr | value | x | |
ExtSlice | dims | x | |
FloorDiv | |||
For | body | x | |
iter | x | ||
orelse | x | ||
target | x | ||
type_comment | |||
FormattedValue | conversion | Used internally by interpreter. | |
format_spec | Format specification does not improve program understanding. | ||
value | x | ||
FunctionDef | args | x | |
body | x | ||
decorator_list | x | ||
name | x | ||
returns | x | ||
type_comment | Not clear how this is used. | ||
GeneratorExp | elt | x | |
generators | x | ||
Global | names | x | |
Gt | |||
GtE | |||
If | body | x | |
orelse | x | ||
test | x | ||
IfExp | body | x | |
orelse | x | ||
test | x | ||
Import | names | x | |
ImportFrom | level | Not clear how this is used. | |
module | x | ||
names | x | ||
In | |||
Index | value | x | |
Invert | |||
Is | |||
IsNot | |||
JoinedStr | values | Converted into a node directly | |
LShift | |||
Lambda | args | Increases graph complexity, but does not provide improvement for the program understanding. | |
body | x | ||
List | ctx | Whether variable in store or load context. Does not improve program understanding. | |
elts | x | ||
ListComp | elt | x | |
generators | x | ||
Lt | |||
LtE | |||
MatMult | |||
Mod | |||
Module | body | x | |
type_ignores | |||
Mult | |||
Name | ctx | Converted into a node directly | |
id | |||
NameConstant | kind | Converted into a node directly | |
value | |||
Nonlocal | names | x | |
Not | |||
NotEq | |||
NotIn | |||
Num | n | Converted into a node directly | |
Or | |||
Pass | |||
Pow | |||
RShift | |||
Raise | cause | x | |
exc | x | ||
Return | value | x | |
Set | elts | x | |
SetComp | elt | x | |
generators | x | ||
Slice | lower | x | |
step | x | ||
upper | x | ||
Starred | ctx | Whether variable in store or load context. Does not improve program understanding. | |
value | x | ||
Str | s | Converted into a node directly | |
Sub | |||
Subscript | ctx | Whether variable in store or load context. Does not improve program understanding. | |
slice | x | ||
value | x | ||
Try | body | x | |
finalbody | x | ||
handlers | x | ||
orelse | x | ||
Tuple | ctx | Whether variable in store or load context. Does not improve program understanding. | |
elts | x | ||
UAdd | |||
USub | |||
UnaryOp | op | x | |
operand | x | ||
While | body | x | |
orelse | x | ||
test | x | ||
With | body | x | |
items | x | ||
type_comment | |||
Yield | value | x | |
YieldFrom | value | x | |
alias | asname | x | |
name | x | ||
arg | annotation | x | |
arg | x | ||
type_comment | Not clear how this is used. | ||
arguments | args | x | |
defaults | Do not include to avoid type annotation hints. | ||
kw_defaults | Do not include to avoid type annotation hints. | ||
kwarg | x | ||
kwonlyargs | x | ||
posonlyargs | x | ||
vararg | x | ||
comprehension | ifs | x | |
is_async | Does not improve program understanding. | ||
iter | x | ||
target | x | ||
keyword | arg | x | |
value | x | ||
withitem | context_expr | x | |
optional_vars | x |
Moreover, we introduce additional edges into AST to increase connection density of the graph. The list of additional edges can be found in table below.
Click to expand additional edges
Node Type | Edge Type | Comment |
---|---|---|
Module | defined_in_module | Connect module node with all expressions and definitions in this module |
FunctionDef | defined_in_function | Connect function definition node with all expressions and definitions inside functions body |
ClassDef | defined_in_class | Connect class definition node with all expressions and definitions in this class |
With | executed_inside_with | Connects with node with expressions and definitions in the body of with statement |
If | executed_if_true | Connects if node with expressions and definitions in the body of true branch of if statement |
executed_if_false | Connects if node with expressions and definitions in the body of false branch of if statement | |
For | executed_in_for | Connects for node with expressions and definitions in the body of for loop |
executed_in_for_orelse | Connects for node with expressions and definitions in the orelse body of for loop | |
AsyncFor | executed_in_for | Connects for node with expressions and definitions in the body of for loop |
executed_in_for_orelse | Connects for node with expressions and definitions in the orelse body of for loop | |
While | executed_in_while | Connects while node with expressions and definitions in the body of while loop |
executed_while_true | Connects while condition node with expressions and definitions in the body of while loop | |
Try | executed_in_try | Connects try node with expressions and definitions in the body of try statement |
executed_in_try_final | Connects try node with expressions and definitions in the body of final statement | |
executed_in_try_else | Connects try node with expressions and definitions in the body of orelse statement | |
executed_in_try_except | Connects try node with expressions and definitions in the body of except statement | |
executed_with_try_handler | Connects Exceptin Handler type node with expressions and definitions in the body of except statement. Need this to resolve Exception Handler globally. | |
Expression | next | Connects expression in a body to the next expression |
Blue edges represent additional edges that do not appear in AST.
with open(a) as b:
do_stuff(b)
a: int = 5
def m():
pass
[i for i in list if i != 5]
for i in list:
k = fn(i)
if k == 4:
fn2(k)
break
else:
fn2(0)
if d is True:
a = b
else:
a = c
a = 5 if b is True else 0
try:
a = b
except Exception as e:
a = c
else:
a = d
finally:
print(a)
def f(a: int = 5):
return a
def __init__(self, argument: int):
"""
Initialize. Инициализация
:param argument:
"""
self.field = argument
def method2(self) -> str:
"""
Simple operations.
Простые операции.
:return:
"""
variable1: int = self.field
variable2: str = str(variable1)
return variable2
def main():
a = Number(4)
b = Number(5)
print(a+b)
class ExampleClass:
def __init__(self, argument: int):
"""
Initialize. Инициализация
:param argument:
"""
self.field = argument
def method1(self) -> str:
"""
Call another method. Вызов другого метода.
:return:
"""
return self.method2()
def method2(self) -> str:
"""
Simple operations.
Простые операции.
:return:
"""
variable1: int = self.field
variable2: str = str(variable1)
return variable2
from Module import Number
def main():
a = Number(4)
b = Number(5)
print(a+b)
main()