GLOVE 사용법 ( Global Vectors for Word Representation ) - beyondnlp/nlp GitHub Wiki
* https://github.com/stanfordnlp/GloVe/tree/master/src
- input 파일과 ouput 파일을 지정
#!/bin/bash
CDIR=$(readlink -f $(dirname $(readlink -f ${BASH_SOURCE[0]})))
PDIR=$(readlink -f $(dirname $(readlink -f ${BASH_SOURCE[0]}))/../)
if [ $# -ne 3 ];then
echo "./$0 <infile> <outfile>";
fi
input=$1
output=$2
VOCAB="vocab.txt"
COOCCUR="cooccurrences.bin"
COOCCUR_SHUF="cooccurrences.bin.shuf"
vecsize=(50 100 200)
for i in "${vecsize[@]}"
do
iternum=50
if [ $i -ge 300 ]
then
iternum=100
fi
$CDIR/build/vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < $input > $VOCAB.$i
$CDIR/build/cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file $VOCAB.i -memory 8.0 -overflow-file tempoverflow < $input > $COOCCUR.$i
$CDIR/build/shuffle -verbose 2 -memory 8.0 < $COOCCUR.$i > $COOCCUR.SHUF.$i
$CDIR/build/glove -input-file $COOCCUR.SHUF.$i -vocab-file $VOCAB.$i -save-file $output.$i -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2
done
- plain text 형식의 파일을 입력으로 넣고 각 word별 빈도수를 계산해준다.
- 공백단위로 분리할 코퍼스 필요
- (stanford toknizer를 사용하여 분리)
- 최저 빈도수 또는 전체 vocab수 기준으로 vocabulary생성
-verbose <int>
Set verbosity: 0, 1, or 2 (default)
-max-vocab <int>
Upper bound on vocabulary size, i.e. keep the <int> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet.
-min-count <int>
Lower limit such that words which occur fewer than <int> times are discarded.
Example usage:
./vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < corpus.txt > vocab.txt
* example>
* 입력 : corpus.txt
헐 신고때려
때 존나탈듯 이쁘긴이뽀
하늘의 별들이
너도 같이오나좌
내 맘에 꽃가루가 떠다니나봥
피카츄 라이츄
그녈 가진기분 최고
자꾸 흔들리니
봄향기가 보야
난1 젤조아
* ./vocab_count -verbose 2 -max-vocab 10000 -min-count 1 < input.txt > vocab.txt
* 출력 : vocab.txt
진짜 2781
아 2673
헐 1873
와 1621
나 1565
난 1460
존나 1404
감사합니다 1233
네 1226
나도 1171
* 코퍼스에서 단어와 단어사이의 cooccurrence 통계 생성
* 사용자는 vocabulary file을 준비, vocab_count를 생성
* word1 word2의 빈도를 카운팅
* 빈도 카운팅을 위해 키워드 순으로 정렬된 데이터를 내린다.
* 그래서 다음 단계로 shuffling을 진행
* 데이터는 아래 형식으로 저장
coocucur.c 코드를 보면
n*n메트릭스를 만들고
'A * B'와 'B * A'를 모두 계산하는 코드가 있는데
두 문자열을 비교해 항상 왼쪽에 작은 문자열을 위치시키면
하나만 계산해도 되지 않을까 싶다.
A * B나 B * A나 coocucur을 동일하기 때문에
typedef struct cooccur_rec {
int word1;
int word2;
real val;
} CREC;
-verbose <int>
Set verbosity: 0, 1, or 2 (default)
-symmetric <int>
If <int> = 0, only use left context; if <int> = 1 (default), use left and right
-window-size <int>
Number of context words to the left (and to the right, if symmetric = 1); default 15
-vocab-file <file>
File containing vocabulary (truncated unigram counts, produced by 'vocab_count');
default vocab.txt
-memory <float>
Soft limit for memory consumption, in GB -- based on simple heuristic,
so not extremely accurate; default 4.0
-max-product <int>
Limit the size of dense cooccurrence array by specifying the max product
<int> of the frequency counts of the two cooccurring words.
This value overrides that which is automatically produced by '-memory'.
Typically only needs adjustment for use with very large corpora.
-overflow-length <int>
Limit to length <int> the sparse overflow array,
which buffers cooccurrence data that does not fit in the dense array, before writing to disk.
This value overrides that which is automatically produced by '-memory'.
Typically only needs adjustment for use with very large corpora.
-overflow-file <file>
Filename, excluding extension, for temporary files; default overflow
Example usage:
./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrences.bin
* example
* ./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrence.bin
* 1438704 6월 29 15:34 cooccurrence.bin ← 생성됨
* coocucur프로그램의 출력 파일인 cooccurrences.bin파일을 입력으로 하여 내부 값들을 섞어준다
-verbose <int>
Set verbosity: 0, 1, or 2 (default)
-memory <float>
Soft limit for memory consumption, in GB; default 4.0
-array-size <int>
Limit to length <int> the buffer which stores chunks of data to shuffle before writing to disk.
This value overrides that which is automatically produced by '-memory'.
-temp-file <file>
Filename, excluding extension, for temporary files; default temp_shuffle
* Example usage: (assuming 'cooccurrence.bin' has been produced by 'coccur')
* ./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin
* example>
* ./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin
* cooccurence파일로 GloVe Model을 학습
* 사용자는 vocab_count로 만든 vocabulary를 제공
-verbose <int>
Set verbosity: 0, 1, or 2 (default)
-vector-size <int>
Dimension of word vector representations (excluding bias term); default 50
-threads <int>
Number of threads; default 8
-iter <int>
Number of training iterations; default 25
-eta <float>
Initial learning rate; default 0.05
-alpha <float>
Parameter in exponent of weighting function; default 0.75
-x-max <float>
Parameter specifying cutoff in weighting function; default 100.0
-binary <int>
Save output in binary format (0: text, 1: binary, 2: both); default 0
-model <int>
Model for word vector output (for text output only); default 2
0: output all data, for both word and context word vectors, including bias terms
1: output word vectors, excluding bias terms
2: output word vectors + context word vectors, excluding bias terms
-input-file <file>
Binary input file of shuffled cooccurrence data (produced by 'cooccur' and 'shuffle'); default cooccurrence.shuf.bin
-vocab-file <file>
File containing vocabulary (truncated unigram counts, produced by 'vocab_count'); default vocab.txt
-save-file <file>
Filename, excluding extension, for word vector output; default vectors
-gradsq-file <file>
Filename, excluding extension, for squared gradient output; default gradsq
-save-gradsq <int>
Save accumulated squared gradients; default 0 (off); ignored if gradsq-file is specified
Example usage:
./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file vectors -gradsq-file gradsq -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2
```
* example>
./glove -input-file cooccurrences.shuf.bin \
-vocab-file vocab.txt \
-save-file vectors \
-gradsq-file gradsq \
-verbose 2 \
-vector-size 100 \
-threads 16 \
-alpha 0.75\
-x-max 100.0 \
-eta 0.05 \
-binary 2 \
-model 2
* output>
1652368080 6월 29 15:37 vectors.bin 1652368080 6월 29 15:37 gradsq.bin 978456328 6월 29 15:38 vectors.txt 1865983020 6월 29 15:38 gradsq.txt