04. Graph Format Description - VitalyRomanov/method-embedding GitHub Wiki

Files

The graph dataset has the following structure

graph_dataset    
│
└───no_ast
│   │───common_call_seq.bz2
│   │───common_edges.bz2
│   │───common_function_variable_pairs.bz2
│   │───common_nodes.bz2
│   │───common_source_graph_bodies.bz2
│   └───node_names.bz2
│   
└───with_ast
    │───common_call_seq.bz2
    │───common_edges.bz2
    │───common_function_variable_pairs.bz2
    │───common_nodes.bz2
    │───common_source_graph_bodies.bz2
    └───node_names.bz2

no_ast contains graph built from global relationships only. with_ast contains graph with AST nodes and edges. Two main files for building the graph are common_nodes.bz2 and common_edges.bz2.

common_nodes.bz2

Pandas DataFrame pickle containing node data. The columns are

id,type,serialized_name,mentioned_in,string 

The first column is global node id, the second is node type, third is a node name. Column mentioned_in stores ids of functions where AT nodes have appeared. The column string has string representation for some of AST nodes.

common_edges.bz2

Pandas DataFrame pickle containing edge data. The columns are

id,type,source_node_id,target_node_id,file_id,mentioned_in 

As with nodes, the column mentined_in stores id of a function where a given edge has appeared. By grouping edges with identical mentioned_in id one can get subgraphs for separate functions.

common_call.bz2

Pandas DataFrame pickle containing edges for training Call Sequence Prediction Objective. These edges show which function is called after a given function.

common_function_variable_pairs.bz2

Pandas DataFrame pickle containing data for Variable Use Objective. Each function is connected to a set of variable names used inside this function.

common_source_graph_bodies.bz2

Pandas DataFrame pickle containing ids of functions, their source code, the list of spans that represent AST nodes appeared in the body of this function.

common_source_graph_bodies.bz2

Pandas DataFrame pickle containing data used for Name Prediction Objective. Functions and variables are connected to their names.

Converting data to CSV

The files are stored as pickled pandas table (read with pandas.read_pickle) and probably not portable between platforms. One can view the content by converting table into the csv format

python SourceCodeTools/code/data/sourcetrail/pandas_format_converter.py common_nodes.bz2 csv

AST Graph

Source Code Gaph is based on AST. However, we introduce several modification. Some AST nodes have simplified representations. For example, we do not reprsent strings using their content, but simply as a string constant. We argue that for program understanding nde types plays more important role. This allows us to minimize the number of unique nodes. Not all edges that are present in AST appear in our Source Code Graph. Some edges are omitted. The list of all edges take from AST can be found in table below.

Click to for edge list
NodeType Available Edges Used Comment
Add
And
AnnAssign annotation x
simple Flag set to 1 when target is a complex expression. Do not need this.
target x
value x
Assert msg x
test x
Assign targets x
type_comment Not clear how this is used.
value x
AsyncFor body x
iter x
orelse x
target x
type_comment Not clear how this is used.
AsyncFunctionDef args x
body x
decorator_list x
name x
returns x
type_comment Not clear how this is used.
AsyncWith body x
items x
type_comment Not clear how this is used.
Attribute attr x
ctx Whether variable in store or load context. Does not improve program understanding.
value x
AugAssign op x
target x
value x
Await value x
BinOp left x
op x
right x
BitAnd
BitOr
Break
Bytes s Converted into a node directly
Call args x
func x
keywords x
ClassDef bases Inheritance handled by sourcetrail
body x
decorator_list
keywords Keywords for metaclasses. Feature is rarely used.
name x
Compare comparators x
left x
ops x
Constant kind Converted into a node directly
value
Continue
Delete targets x
Dict keys x
values x
DictComp generators x
key x
value x
Div
Ellipsis
Eq
ExceptHandler body x
name Specific name does not seem to improve understanding of the program. Exception type has more meaning.
type x
Expr value x
ExtSlice dims x
FloorDiv
For body x
iter x
orelse x
target x
type_comment
FormattedValue conversion Used internally by interpreter.
format_spec Format specification does not improve program understanding.
value x
FunctionDef args x
body x
decorator_list x
name x
returns x
type_comment Not clear how this is used.
GeneratorExp elt x
generators x
Global names x
Gt
GtE
If body x
orelse x
test x
IfExp body x
orelse x
test x
Import names x
ImportFrom level Not clear how this is used.
module x
names x
In
Index value x
Invert
Is
IsNot
JoinedStr values Converted into a node directly
LShift
Lambda args Increases graph complexity, but does not provide improvement for the program understanding.
body x
List ctx Whether variable in store or load context. Does not improve program understanding.
elts x
ListComp elt x
generators x
Lt
LtE
MatMult
Mod
Module body x
type_ignores
Mult
Name ctx Converted into a node directly
id
NameConstant kind Converted into a node directly
value
Nonlocal names x
Not
NotEq
NotIn
Num n Converted into a node directly
Or
Pass
Pow
RShift
Raise cause x
exc x
Return value x
Set elts x
SetComp elt x
generators x
Slice lower x
step x
upper x
Starred ctx Whether variable in store or load context. Does not improve program understanding.
value x
Str s Converted into a node directly
Sub
Subscript ctx Whether variable in store or load context. Does not improve program understanding.
slice x
value x
Try body x
finalbody x
handlers x
orelse x
Tuple ctx Whether variable in store or load context. Does not improve program understanding.
elts x
UAdd
USub
UnaryOp op x
operand x
While body x
orelse x
test x
With body x
items x
type_comment
Yield value x
YieldFrom value x
alias asname x
name x
arg annotation x
arg x
type_comment Not clear how this is used.
arguments args x
defaults Do not include to avoid type annotation hints.
kw_defaults Do not include to avoid type annotation hints.
kwarg x
kwonlyargs x
posonlyargs x
vararg x
comprehension ifs x
is_async Does not improve program understanding.
iter x
target x
keyword arg x
value x
withitem context_expr x
optional_vars x

Moreover, we introduce additional edges into AST to increase connection density of the graph. The list of additional edges can be found in table below.

Click to expand additional edges
Node Type Edge Type Comment
Module defined_in_module Connect module node with all expressions and definitions in this module
FunctionDef defined_in_function Connect function definition node with all expressions and definitions inside functions body
ClassDef defined_in_class Connect class definition node with all expressions and definitions in this class
With executed_inside_with Connects with node with expressions and definitions in the body of with statement
If executed_if_true Connects if node with expressions and definitions in the body of true branch of if statement
executed_if_false Connects if node with expressions and definitions in the body of false branch of if statement
For executed_in_for Connects for node with expressions and definitions in the body of for loop
executed_in_for_orelse Connects for node with expressions and definitions in the orelse body of for loop
AsyncFor executed_in_for Connects for node with expressions and definitions in the body of for loop
executed_in_for_orelse Connects for node with expressions and definitions in the orelse body of for loop
While executed_in_while Connects while node with expressions and definitions in the body of while loop
executed_while_true Connects while condition node with expressions and definitions in the body of while loop
Try executed_in_try Connects try node with expressions and definitions in the body of try statement
executed_in_try_final Connects try node with expressions and definitions in the body of final statement
executed_in_try_else Connects try node with expressions and definitions in the body of orelse statement
executed_in_try_except Connects try node with expressions and definitions in the body of except statement
executed_with_try_handler Connects Exceptin Handler type node with expressions and definitions in the body of except statement. Need this to resolve Exception Handler globally.
Expression next Connects expression in a body to the next expression

Examples of graph built from AST

Blue edges represent additional edges that do not appear in AST.

With statement

with open(a) as b:
   do_stuff(b)

AnnAssign expression

a: int = 5

Function definition with control statement

def m():
       pass

Comprehension

[i for i in list if i != 5]

For loop

for i in list:
   k = fn(i)
   if k == 4:
       fn2(k)
       break
else:
   fn2(0)

If statement

if d is True:
   a = b
else:
   a = c

Assignment with if expression

a = 5 if b is True else 0

Exception

try:
   a = b
except Exception as e:
   a = c
else:
   a = d
finally:
   print(a)

Function definition with type annotation

def f(a: int = 5):
   return a

Method-level graph

def __init__(self, argument: int):
        """
        Initialize. Инициализация
        :param argument:
        """
        self.field = argument

def method2(self) -> str:
        """
        Simple operations.
        Простые операции.
        :return:
        """
        variable1: int = self.field
        variable2: str = str(variable1)
        return variable2

def main():
    a = Number(4)
    b = Number(5)
    print(a+b)

Class-level graph

class ExampleClass:
    def __init__(self, argument: int):
        """
        Initialize. Инициализация
        :param argument:
        """
        self.field = argument

    def method1(self) -> str:
        """
        Call another method. Вызов другого метода.
        :return:
        """
        return self.method2()

    def method2(self) -> str:
        """
        Simple operations.
        Простые операции.
        :return:
        """
        variable1: int = self.field
        variable2: str = str(variable1)
        return variable2

Module-level graph

from Module import Number

def main():
    a = Number(4)
    b = Number(5)
    print(a+b)

main()

⚠️ **GitHub.com Fallback** ⚠️