gloVe

* https://github.com/stanfordnlp/GloVe/tree/master/src

input 파일과 ouput 파일을 지정

script

#!/bin/bash

CDIR=$(readlink -f $(dirname $(readlink -f ${BASH_SOURCE[0]})))
PDIR=$(readlink -f $(dirname $(readlink -f ${BASH_SOURCE[0]}))/../)

if [ $# -ne 3 ];then
    echo "./$0 <infile> <outfile>";
fi

input=$1
output=$2
VOCAB="vocab.txt"
COOCCUR="cooccurrences.bin"
COOCCUR_SHUF="cooccurrences.bin.shuf"

vecsize=(50 100 200)
for i in "${vecsize[@]}"
do
    iternum=50
    if [ $i -ge 300 ]
    then
        iternum=100
    fi

    $CDIR/build/vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < $input > $VOCAB.$i
    $CDIR/build/cooccur     -verbose 2 -symmetric 0 -window-size 10 -vocab-file $VOCAB.i -memory 8.0 -overflow-file tempoverflow < $input > $COOCCUR.$i
    $CDIR/build/shuffle -verbose 2 -memory 8.0 < $COOCCUR.$i > $COOCCUR.SHUF.$i
    $CDIR/build/glove  -input-file $COOCCUR.SHUF.$i -vocab-file $VOCAB.$i -save-file $output.$i -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2
done

vocab_count

plain text 형식의 파일을 입력으로 넣고 각 word별 빈도수를 계산해준다.
공백단위로 분리할 코퍼스 필요
(stanford toknizer를 사용하여 분리)
최저 빈도수 또는 전체 vocab수 기준으로 vocabulary생성

	-verbose <int>
		Set verbosity: 0, 1, or 2 (default)
	-max-vocab <int>
		Upper bound on vocabulary size, i.e. keep the <int> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet.
	-min-count <int>
		Lower limit such that words which occur fewer than <int> times are discarded.

        Example usage:
        ./vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < corpus.txt > vocab.txt

* example>
* 입력 : corpus.txt

헐 신고때려
때 존나탈듯 이쁘긴이뽀
하늘의 별들이
너도 같이오나좌
내 맘에 꽃가루가 떠다니나봥
피카츄 라이츄
그녈 가진기분 최고
자꾸 흔들리니
봄향기가 보야
난1 젤조아

* ./vocab_count -verbose 2 -max-vocab 10000 -min-count 1 < input.txt > vocab.txt
* 출력 : vocab.txt

진짜 2781
아 2673
헐 1873
와 1621
나 1565
난 1460
존나 1404
감사합니다 1233
네 1226
나도 1171

coocucur

* 코퍼스에서 단어와 단어사이의 cooccurrence 통계 생성
* 사용자는 vocabulary file을 준비, vocab_count를 생성
* word1 word2의 빈도를 카운팅 
* 빈도 카운팅을 위해 키워드 순으로 정렬된 데이터를 내린다.
* 그래서 다음 단계로 shuffling을 진행
* 데이터는 아래 형식으로 저장


coocucur.c 코드를 보면
n*n메트릭스를 만들고
'A * B'와 'B * A'를 모두 계산하는 코드가 있는데
두 문자열을 비교해 항상 왼쪽에 작은 문자열을 위치시키면
하나만 계산해도 되지 않을까 싶다.
A * B나 B * A나 coocucur을 동일하기 때문에

  typedef struct cooccur_rec {
      int word1;
      int word2;
      real val;
  } CREC;

	-verbose <int>
		Set verbosity: 0, 1, or 2 (default)
	-symmetric <int>
		If <int> = 0, only use left context; if <int> = 1 (default), use left and right
	-window-size <int>
		Number of context words to the left (and to the right, if symmetric = 1); default 15
	-vocab-file <file>
		File containing vocabulary (truncated unigram counts, produced by 'vocab_count'); 
                default vocab.txt
	-memory <float>
		Soft limit for memory consumption, in GB -- based on simple heuristic, 
                so not extremely accurate; default 4.0
	-max-product <int>
		Limit the size of dense cooccurrence array by specifying the max product 
                <int> of the frequency counts of the two cooccurring words.
		This value overrides that which is automatically produced by '-memory'. 
                Typically only needs adjustment for use with very large corpora.
	-overflow-length <int>
		Limit to length <int> the sparse overflow array, 
                which buffers cooccurrence data that does not fit in the dense array, before writing to disk.
		This value overrides that which is automatically produced by '-memory'. 
                Typically only needs adjustment for use with very large corpora.
	-overflow-file <file>
		Filename, excluding extension, for temporary files; default overflow

        Example usage:
        ./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrences.bin

* example
* ./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrence.bin
* 1438704  6월 29 15:34 cooccurrence.bin ← 생성됨

shuffle

* coocucur프로그램의 출력 파일인 cooccurrences.bin파일을 입력으로 하여 내부 값들을 섞어준다

	-verbose <int>
		Set verbosity: 0, 1, or 2 (default)
	-memory <float>
		Soft limit for memory consumption, in GB; default 4.0
	-array-size <int>
		Limit to length <int> the buffer which stores chunks of data to shuffle before writing to disk.
		This value overrides that which is automatically produced by '-memory'.
	-temp-file <file>
		Filename, excluding extension, for temporary files; default temp_shuffle
    * Example usage: (assuming 'cooccurrence.bin' has been produced by 'coccur')
    * ./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin

* example>
*  ./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin

glove

* cooccurence파일로 GloVe Model을 학습
* 사용자는  vocab_count로 만든 vocabulary를 제공

	-verbose <int>
		Set verbosity: 0, 1, or 2 (default)
	-vector-size <int>
		Dimension of word vector representations (excluding bias term); default 50
	-threads <int>
		Number of threads; default 8
	-iter <int>
		Number of training iterations; default 25
	-eta <float>
		Initial learning rate; default 0.05
	-alpha <float>
		Parameter in exponent of weighting function; default 0.75
	-x-max <float>
		Parameter specifying cutoff in weighting function; default 100.0
	-binary <int>
		Save output in binary format (0: text, 1: binary, 2: both); default 0
	-model <int>
		Model for word vector output (for text output only); default 2
		   0: output all data, for both word and context word vectors, including bias terms
		   1: output word vectors, excluding bias terms
		   2: output word vectors + context word vectors, excluding bias terms
	-input-file <file>
		Binary input file of shuffled cooccurrence data (produced by 'cooccur' and 'shuffle'); default cooccurrence.shuf.bin
	-vocab-file <file>
		File containing vocabulary (truncated unigram counts, produced by 'vocab_count'); default vocab.txt
	-save-file <file>
		Filename, excluding extension, for word vector output; default vectors
	-gradsq-file <file>
		Filename, excluding extension, for squared gradient output; default gradsq
	-save-gradsq <int>
		Save accumulated squared gradients; default 0 (off); ignored if gradsq-file is specified

Example usage:
./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file vectors -gradsq-file gradsq -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2
    ```

    * example>
./glove -input-file cooccurrences.shuf.bin \
-vocab-file vocab.txt \
-save-file vectors \
-gradsq-file gradsq \
-verbose 2 \
-vector-size 100 \
-threads 16 \
-alpha 0.75\
-x-max 100.0 \
-eta 0.05 \
-binary 2 \
-model 2

* output>

1652368080 6월 29 15:37 vectors.bin 1652368080 6월 29 15:37 gradsq.bin 978456328 6월 29 15:38 vectors.txt 1865983020 6월 29 15:38 gradsq.txt

GLOVE 사용법 ( Global Vectors for Word Representation ) - beyondnlp/nlp GitHub Wiki

gloVe

script

vocab_count

coocucur

shuffle

glove

⚠️ GitHub.com Fallback ⚠️

GLOVE 사용법 ( Global Vectors for Word Representation ) - beyondnlp/nlp GitHub Wiki

gloVe

script

vocab_count

coocucur

shuffle

glove

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️