05. Loading Data - VitalyRomanov/method-embedding GitHub Wiki

Reading dataset files

Files are stored as DataFrame pickle. Each file can be read with unpersist function

from SourceCodeTools.code.data.file_utils import unpersist
nodes = unpersist("common_nodes.bz2")

To load files for building the graph use load_graph

from SourceCodeTools.code.data.dataset.reader import load_data
nodes, edges = load_data("path/to/dataset/")

Reading source code aligned with graph nodes

It is possible to read source code aligned with graph nodes using load_aligned_source_code. It will return a generator that will iterate over the rows of common_filecontent.bz2. Assuming that the content of this file is

   file_id                         filecontent                                package
0        1  import numpy\nnumpy.array([1,2,3])                             any_name_1
1        2               from numpy import *\n                             any_name_1

Result of reading aligned source code with

from SourceCodeTools.code.data.dataset.reader import load_aligned_source_code
for tokens, node_tags in load_aligned_source_code(data_path):
    for t, tt in zip(tokens, node_tags):
        print(t, tt, sep="\t")
    print()

will return

<s>	O
import	O
Ġn	O
umpy	O
Ċ	O
n	B-1
umpy	L-1
.	O
array	O
([	O
1	U-9
,	O
2	U-9
,	O
3	U-9
])	O
</s>	O

<s>	O
from	O
Ġn	O
umpy	O
Ġimport	O
Ġ*	O
Ċ	O
</s>	O

Note that load_aligned_source_code has a keyword tokenizer. Supported tokenizers are "codebert" (default) and "spacy". The result using "spacy" is

import	O
numpy	O

	O
numpy	U-1
.	O
array	O
(	O
[	O
1	U-9
,	O
2	U-9
,	O
3	U-9
]	O
)	O

from	O
numpy	O
import	O
*	O

	O

Each token is associated either with O tag or a node id tag. All tags follow BILUO convention. The number in the tag corresponds to node id.