05. Loading Data - VitalyRomanov/method-embedding GitHub Wiki
Reading dataset files
Files are stored as DataFrame pickle. Each file can be read with unpersist
function
from SourceCodeTools.code.data.file_utils import unpersist
nodes = unpersist("common_nodes.bz2")
To load files for building the graph use load_graph
from SourceCodeTools.code.data.dataset.reader import load_data
nodes, edges = load_data("path/to/dataset/")
Reading source code aligned with graph nodes
It is possible to read source code aligned with graph nodes using load_aligned_source_code
. It will return a generator that will iterate over the rows of common_filecontent.bz2
. Assuming that the content of this file is
file_id filecontent package
0 1 import numpy\nnumpy.array([1,2,3]) any_name_1
1 2 from numpy import *\n any_name_1
Result of reading aligned source code with
from SourceCodeTools.code.data.dataset.reader import load_aligned_source_code
for tokens, node_tags in load_aligned_source_code(data_path):
for t, tt in zip(tokens, node_tags):
print(t, tt, sep="\t")
print()
will return
<s> O
import O
Ġn O
umpy O
Ċ O
n B-1
umpy L-1
. O
array O
([ O
1 U-9
, O
2 U-9
, O
3 U-9
]) O
</s> O
<s> O
from O
Ġn O
umpy O
Ġimport O
Ġ* O
Ċ O
</s> O
Note that load_aligned_source_code
has a keyword tokenizer
. Supported tokenizers are "codebert"
(default) and "spacy"
. The result using "spacy"
is
import O
numpy O
O
numpy U-1
. O
array O
( O
[ O
1 U-9
, O
2 U-9
, O
3 U-9
] O
) O
from O
numpy O
import O
* O
O
Each token is associated either with O
tag or a node id tag. All tags follow BILUO convention. The number in the tag corresponds to node id.