usage CRFSUITE - beyondnlp/nlp GitHub Wiki

crfsuite

* http://www.chokkan.org/software/crfsuite/

crfsuite learn [์˜ต์…˜] [ํ•™์Šต๋ฌธ์„œ]

  • -m ๋ชจ๋ธ๋ช…
  • -g N : N๊ฐœ์˜ ๊ทธ๋ฃน์œผ๋กœ ํ•™์Šต๋ฌธ์„œ๋ฅผ ๋ถ„๋ฆฌ
  • -x : cross-validation์„ ํ•˜๊ธฐ ์œ„ํ•œ ์˜ต์…˜
  • -p key=val : ๋‚ด๋ถ€์˜ ํ”„๋กœํผํ‹ฐ๋ฅผ ์„ค์ •ํ•˜๊ธฐ ์œ„ํ•œ ์˜ต์…˜( ex> mincount์„ค์ • )
    • crfsuite learn -H ์œผ๋กœ ์„ค์ • ๊ฐ€๋Šฅํ•œ ํ”„๋กœํผํ‹ฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ํ™•์ธํ• ์ˆ˜ ์žˆ์Œ
loat feature.minfreq = 0.000000;
The minimum frequency of features.
 
int feature.possible_states = 0;
Force to generate possible state features.
 
int feature.possible_transitions = 0;
Force to generate possible transition features.
 
float c1 = 0.000000;
Coefficient for L1 regularization.
 
float c2 = 1.000000;
Coefficient for L2 regularization.
 
int max_iterations = 2147483647;
The maximum number of iterations for L-BFGS optimization.
 
int num_memories = 6;
The number of limited memories for approximating the inverse hessian matrix.
 
float epsilon = 0.000010;
Epsilon for testing the convergence of the objective.
 
int period = 10;
The duration of iterations to test the stopping criterion.
 
float delta = 0.000010;
The threshold for the stopping criterion; an L-BFGS iteration stops when the
improvement of the log likelihood over the last ${period} iterations is no
greater than this threshold.
 
string linesearch = MoreThuente;
The line search algorithm used in L-BFGS updates:
{   'MoreThuente': More and Thuente's method,
    'Backtracking': Backtracking method with regular Wolfe condition,
    'StrongBacktracking': Backtracking method with strong Wolfe condition
}
 
 
int max_linesearch = 20;
The maximum number of trials for the line search algorithm.
  • -g : ์•Œ๊ณ ๋ฆฌ์ฆ˜ ( lbfgs, l2sgd, ap, pa, arow )
  • -e M : M๋ฒˆ์งธ ๊ทธ๋ฃน๋งŒ ํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉ, ๋‚˜๋จธ์ง€๋Š” ํ•™์Šต์— ์‚ฌ์šฉ, ์ถœ๋ ฅ๊ฒฐ๊ณผ์— ํด๋ž˜์Šค๋ณ„ precision, recall, f1-score๋ฅผ ํ‘œ์‹œ
cross-validation์„ ์‹คํ–‰์ค‘์ผ๋•Œ๋Š” modelํŒŒ์ผ์ด ์ƒ์„ฑ๋˜์ง€ ์•Š์Œ
- ์„ stdin์œผ๋กœ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ์Œ

crfsuite tag [์˜ต์…˜] [ํ…Œ์ŠคํŠธ๋ฌธ์„œ]

  • -m ๋ชจ๋ธ๋ช… : ํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉํ•  ๋ชจ๋ธ
  • -t : ๋ชจ๋ธ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ๋ฆฌํฌํŠธ
  • -r : ํ…Œ์ŠคํŠธ ๋ฌธ์„œ์— ์žˆ๋Š” ๋ ˆ์ด๋ธ”์„ ๊ฐ™์ด ์ถœ๋ ฅ
  • -p : ๋ ˆ์ด๋ธ”์˜ ํ™•๋ฅ ๊ฐ’ ์ถœ๋ ฅ
  • -i : ์•„์ดํ…œ๋ณ„ marginal ํ™•๋ฅ  ์ถœ๋ ฅ
  • -q : ํ…Œ์ŠคํŠธ ๋ชจ๋“œ์—์„œ ํƒœ๊น… ๊ฒฐ๊ณผ ์ƒ๋žต

crfsuite dump [๋ชจ๋ธ]

chunking.py input format

Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
's POS B-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
it PRP B-NP
signed VBD B-VP
a DT B-NP
$

chunking.py output

B-NP    w[0]=Rockwell   w[1]=International  w[2]=Corp.  w[0]|w[1]=Rockwell|International    pos[0]=NNP  pos[1]=NNP  pos[2]=NNP  pos[0]|pos[1]=NNP|NNP   pos[1]|pos[2]=NNP|NNP   pos[0]|pos[1]|pos[2]=NNP|NNP|NNP    __BOS__
I-NP    w[-1]=Rockwell  w[0]=International  w[1]=Corp.  w[2]='s w[-1]|w[0]=Rockwell|International   w[0]|w[1]=International|Corp.   pos[-1]=NNP pos[0]=NNP  pos[1]=NNP  pos[2]=POS  pos[-1]|pos[0]=NNP|NNP  pos[0]|pos[1]=NNP|NNP   pos[1]|pos[2]=NNP|POS   pos[-1]|pos[0]|pos[1]=NNP|NNP|NNP   pos[0]|pos[1]|pos[2]=NNP|NNP|POS
I-NP    w[-2]=Rockwell  w[-1]=International w[0]=Corp.  w[1]='s w[2]=Tulsa  w[-1]|w[0]=International|Corp.  w[0]|w[1]=Corp.|'spos[-2]=NNP   pos[-1]=NNP pos[0]=NNP  pos[1]=POS  pos[2]=NNP  pos[-2]|pos[-1]=NNP|NNP pos[-1]|pos[0]=NNP|NNP  pos[0]|pos[1]=NNP|POS   pos[1]|pos[2]=POS|NNP   pos[-2]|pos[-1]|pos[0]=NNP|NNP|NNP  pos[-1]|pos[0]|pos[1]=NNP|NNP|POS   pos[0]|pos[1]|pos[2]=NNP|POS|NNP
B-NP    w[-2]=International w[-1]=Corp. w[0]='s w[1]=Tulsa  w[2]=unit   w[-1]|w[0]=Corp.|'s w[0]|w[1]='s|Tulsa  pos[-2]=NNP pos[-1]=NNP pos[0]=POS  pos[1]=NNP  pos[2]=NN   pos[-2]|pos[-1]=NNP|NNP pos[-1]|pos[0]=NNP|POS  pos[0]|pos[1]=POS|NNP   pos[1]|pos[2]=NNP|NN    pos[-2]|pos[-1]|pos[0]=NNP|NNP|POS  pos[-1]|pos[0]|pos[1]=NNP|POS|NNP   pos[0]|pos[1]|pos[2]=POS|NNP|NN
I-NP    w[-2]=Corp. w[-1]='s    w[0]=Tulsa  w[1]=unit   w[2]=said   w[-1]|w[0]='s|Tulsa w[0]|w[1]=Tulsa|unit    pos[-2]=NNP pos[-1]=POS pos[0]=NNP  pos[1]=NN   pos[2]=VBD  pos[-2]|pos[-1]=NNP|POS pos[-1]|pos[0]=POS|NNP  pos[0]|pos[1]=NNP|NN    pos[1]|pos[2]=NN|VBD    pos[-2]|pos[-1]|pos[0]=NNP|POS|NNP  pos[-1]|pos[0]|pos[1]=POS|NNP|NN    pos[0]|pos[1]|pos[2]=NNP|NN|VBD
I-NP    w[-2]='s    w[-1]=Tulsa w[0]=unit   w[1]=said   w[2]=it w[-1]|w[0]=Tulsa|unit   w[0]|w[1]=unit|said pos[-2]=POS pos[-1]=NNP pos[0]=NN   pos[1]=VBD  pos[2]=PRP  pos[-2]|pos[-1]=POS|NNP pos[-1]|pos[0]=NNP|NN   pos[0]|pos[1]=NN|VBD    pos[1]|pos[2]=VBD|PRP   pos[-2]|pos[-1]|pos[0]=POS|NNP|NN   pos[-1]|pos[0]|pos[1]=NNP|NN|VBD    pos[0]|pos[1]|pos[2]=NN|VBD|PRP
B-VP    w[-2]=Tulsa w[-1]=unit  w[0]=said   w[1]=it w[2]=signed w[-1]|w[0]=unit|said    w[0]|w[1]=said|it   pos[-2]=NNP pos[-1]=NN  pos[0]=VBD  pos[1]=PRP  pos[2]=VBD  pos[-2]|pos[-1]=NNP|NN  pos[-1]|pos[0]=NN|VBD   pos[0]|pos[1]=VBD|PRP   pos[1]|pos[2]=PRP|VBD   pos[-2]|pos[-1]|pos[0]=NNP|NN|VBD   pos[-1]|pos[0]|pos[1]=NN|VBD|PRP    pos[0]|pos[1]|pos[2]=VBD|PRP|VBD
B-NP    w[-2]=unit  w[-1]=said  w[0]=it w[1]=signed w[2]=a  w[-1]|w[0]=said|it  w[0]|w[1]=it|signed pos[-2]=NN  pos[-1]=VBD pos[0]=PRP  pos[1]=VBD  pos[2]=DT   pos[-2]|pos[-1]=NN|VBD  pos[-1]|pos[0]=VBD|PRP  pos[0]|pos[1]=PRP|VBD   pos[1]|pos[2]=VBD|DT    pos[-2]|pos[-1]|pos[0]=NN|VBD|PRP   pos[-1]|pos[0]|pos[1]=VBD|PRP|VBD   pos[0]|pos[1]|pos[2]=PRP|VBD|DT
B-VP    w[-2]=said  w[-1]=it    w[0]=signed w[1]=a  w[2]=tentative  w[-1]|w[0]=it|signed    w[0]|w[1]=signed|a  pos[-2]=VBD pos[-1]=PRP pos[0]=VBD  pos[1]=DT   pos[2]=JJ   pos[-2]|pos[-1]=VBD|PRP pos[-1]|pos[0]=PRP|VBD  pos[0]|pos[1]=VBD|DT    pos[1]|pos[2]=DT|JJ pos[-2]|pos[-1]|pos[0]=VBD|PRP|VBD  pos[-1]|pos[0]|pos[1]=PRP|VBD|DT    pos[0]|pos[1]|pos[2]=VBD|DT|JJ
B-NP    w[-2]=it    w[-1]=signed    w[0]=a  w[1]=tentative  w[2]=agreement  w[-1]|w[0]=signed|a w[0]|w[1]=a|tentative   pos[-2]=PRP pos[-1]=VBD pos[0]=DT   pos[1]=JJ   pos[2]=NN   pos[-2]|pos[-1]=PRP|VBD pos[-1]|pos[0]=VBD|DT   pos[0]|pos[1]=DT|JJ pos[1]|pos[2]=JJ|NN pos[-2]|pos[-1]|pos[0]=PRP|VBD|DT   pos[-1]|pos[0]|pos[1]=VBD|DT|JJ pos[0]|pos[1]|pos[2]=DT|JJ|NN

  • ์ดˆ๊ธฐ์— CRFsuite source๋ฅผ ์ปดํŒŒ์ผํ•ด์„œ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ
    • L-BFGS terminated with error code (-1001)
    • L-BFGS terminated with error code (-998)
  • ๋ฌธ์ œ๋ฅผ ๋งŒ๋‚˜ ๊ณ ์ƒ์„ ํ–ˆ๋‹ค. ์„ค์น˜์ƒ์— ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋˜๊ณ  ํ•ด๊ฒฐ์ฑ…์€ ์ฐพ์ง€ ๋ชปํ–ˆ๋‹ค.
  • ์ด๋Ÿฐ ๊ฒฝ์šฐ binaryํ˜•์‹์„ ๋‹ค์šด๋ฐ›์•„ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.