Rouzeta - beyondnlp/nlp GitHub Wiki

Rouzeta

Rouzeta는 이상호 박사가 최근에 공개한 WFST를 기반 형태소 분석기의 프로젝트명이다.
Rouzeta는 단순히 사용하기 위해서도 알아야 할 여러가지 정보들이 있지만 이를 분석하는 것은 다른 차원의 작업이다.
Rouzeta를 다운 받아 압축을 풀면 Rouzeta와 Tagger디렉토리가 있고 Rouzeta디렉토리에 다음 4개의 파일이 존재한다.

 korean.lexc      : 형태소 간의 전이를 오토마타로 표현, 기본적으로 모든 형태소가 등재돼 있고 품사간 전이도 기술돼 있다.
 morphrules.foma  : 구축된 형태소 오토마타 중 불필요한 edga를 제거하기 위해 기술된 규칙 파일
 splithangul.foma : 완성형을 조합형으로 바꾸기 위한 규칙 파일
 kormoran.script  : 최종 오토마타를 만들기 위한 스크립트 파일

위 파일들을 분석하기 위해서는 형태소 분석에 대한 기본 기식과 함께 lexc, foma, xfst에 대한 지식도 필요하다.
아래 정리된 내용은 그때그때 필요한 내용을 순서없이 나열한 것이다.

Rouzeta Home
- https://shleekr.github.io/
- KSC5601 한글만 처리 가능( splithangul.foma(2352자) = ksc5601 + 렛,앴 )
- Rouzeta 사전에 등재된 엔트리는 모두 ksc5601(2350)내에 있는 음절로 구성돼 있다.
- 아래 결과는 불규칙에 대한 것과, ksc5601을 넘어가는 글자에 대한 처리결과입니다.

apply up> 고마웠다 고맙/irrb/vj었/ep다/ef 고맙/irrb/vj었/ep다/ex apply up> 똠방각하 #ksc5601을 넘어가는 것은 ???로 나타난다. ??? apply up> 펲시 ??? apply up> 기욤까네 기욤/nr까네/nc 기욤/nr까/nc네/nc 기욤/nr까/nc네/np 기욤/nr까/nc네/xn 기욤/nr까/nc네/dn 기욤/nr까/nc이/pp네/ef 기욤/nr까/nc이/pp네/ex apply up> 뷀 ???

# Background knowledge

* foma download url 
* http://slideplayer.com/slide/11006062/
* http://foma.googlecode.com/

* foma notation

[ ] grouping ? any symbol ?* any sequence a a single symbol \a any symbol except a \C any symbol except a consonant, C presumably defined with "define" .#. word edge in rule contexts [a|b] a or b [C|.#.] a consonant or word edge a* any number of a symbols (a) optionally a .o. compose

* http://foma.sourceforge.net/lrec2010/lrec2010handout.pdf
* http://udel.edu/~heinz/classes/2015/608/materials/foma/foma.pdf
* https://www.cs.jhu.edu/~jason/465/PDFSlides/lect17-fsmbuild.pdf 
* http://foma.sourceforge.net/dokuwiki/doku.php?id=wiki:interfacereference
* http://slideplayer.com/slide/10878857/
* FAQ: When to use lexc vs. xfst?
* lexc는 union 연산자에 대해 최적화 돼 있어서 많은 수를 union할 때 사용시 빠르다.

* explain Rouzeta 
* https://github.com/dsindex/rouzeta

##  Kyoto Fst Decoder
* 오토마타를 구성한후 입력 문장과 컴포지션한 후 minimum distance를 구하는데 kyfd를 사용하다.
* kyfd : http://www.phontron.com/kyfd/
* kyfd를 사용하기 위해서는  OpenFst 와 Xerces-C++가 설치돼 있어야 한다.
* kyfd에 대한 자세한 설명은 이후에 다시 작성 예정

## Standard:정규표현식을 생각하면 이해하기 쉽다.
```
A B    Concatenation
A | B  Union
A & B  Intersection
A*     Kleene star
A+     Kleene plus
$A     “Contains” a string from A
A-B    Subtraction
~A     Complement of A
A.r    Reverse of A
(A)    Optionally A (same as A | 0)
```

#Transducer-related:
```
A:B      Cross-product of A and B
A .o. B  Composition of A and B           # <--- 가장 많이 사용하는 기능
A.i      Invert A                         # <--- 적용방향을 반대로
A.u      Extract upper side (domain) of A
A.l      Extract lower side (range) of A
A .P. B  Priority union of A and B
```

## Rewrite operations:
```
A -> B                Rewrite strings in A as B
A (->) B              Optionally rewrite A as B
A -> B || C _ D       Conditional rewrite of A as B (between C and D)
[..] -> B || C _ D    Insert a single B between C and D
A -> B , C -> D ,...  Multiple simultaneous rewrites (w/ or w/o contexts)
A -> B ... C          Markup: insert B before and C after A (w/ or w/o contexts)
```
## https://code.google.com/archive/p/foma/wikis/RegularExpressionReference.wiki
## etc
```
optional replacement (->)
longest-leftmost @->
shortest-leftmost @>
```

## Special symbols:
```
0 or []   Epsilon (the empty string)
?         The “any” symbol
.#.       Word boundary in rewrite rules
[ and ]   Grouping symbols for forcing precedence
“ “       Reserved symbols need to be escaped by quotes
```

## example
## Rouzeta 최상위 ROOT 선언
* ROOT는 다음의 vertex를 가진다
```
LEXICON Root
     ncLexicon ; ! 보통명사
     nbLexicon ; ! 숫자
     nrLexicon ; ! 고유명사
     ...
```
* acLexicon  == 접속부사
* 고로/ac acNext
* 고로/ac는 acNext와 edge를 가지는데 acNext는 다음과 같은 것들로 정의한다.


```
LEXICON acLexicon ! 접속부사
고로/ac acNext ;
곧/ac   acNext ;
그나/ac acNext ;
...

LEXICON acNext
   finLexicon ;
   srLexicon ;
   ...
LEXICON srLexicon ! 닫는따옴표
%"/sr   srNext ;
%'/sr   srNext ;
%)/sr   srNext ;
%>/sr   srNext ;
%]/sr   srNext ;
%}/sr   srNext ;
’/sr   srNext ;
”/sr   srNext ;
≫/sr   srNext ;
〉/sr   srNext ;
》/sr   srNext ;
」/sr   srNext ;
』/sr   srNext ;
】/sr   srNext ;
〕/sr   srNext ;
＂/sr   srNext ;
＞/sr   srNext ;


LEXICON xvLexicon ! 동사파생접미사
거리/xv xvNext ;
당하/xv xvNext ;
당허/xv xvNext ;
...

LEXICON naNext
   xvLexicon ;

LEXICON naLexicon ! 동작성보통명사
가가대소/na naNext ;
가감/na naNext ;
가격/na naNext ;
가격인하/na naNext ;
...
```
* 가격당하다 => 가격 + 당하다 => naLexicon -> naNext -> xvLexicon


## korean.lexc : 형태소(vertex)들 간의 관계(edge)를 기술

## morphrules.foma
* NounStringSet 
  * define NounStringSet [ NounSet | %/xn ] ;
  * define NounSet     [ %/na | %/nc | %/nd | %/ni | %/nm | %/nn | %/np | %/nr | %/ns | %/nu ] ;
* FilterPT0에 대한 정의
* '사과'와 같은 무종성 명사와 '은' 조사의 edge를 제거
* 무종성 + 명사태그 + 은/pt <-를 제거
```
! 은/는
! Filter0 : 사과는
define FilterPT0 ~$[ FILLC NounStringSet ㅇ ㅡ %_ㄴ %/pt ] ;
```
* FilterPT1에 대한 정의
* '사람'과 같은 유종성 명사와 '는' 조사의 edgae를 제거
* 유종성 + 명사태그 + 는/pt <- 를 제거 
```
! Filter1 : 사람은
define FilterPT1 ~$[ [Coda - FILLC] NounStringSet ㄴ ㅡ %_ㄴ %/pt ] ;

```



## splithangul.foma : 자소분리
![split1](https://github.com/beyondnlp/nlp/raw/master/split1)
```
define split        가 -> ㄱ ㅏ %_%_ .o.
                  각 -> ㄱ ㅏ %_ㄱ .o.
                  간 -> ㄱ ㅏ %_ㄴ .o.
                  갇 -> ㄱ ㅏ %_ㄷ ;

regex split;
apply down> 가
ㄱㅏ__
apply down> 나
나
apply down> 다
다
apply down> 각
ㄱㅏ_ㄱ
apply down> 간
ㄱㅏ_ㄴ
```


## 어미표현
```
%_ㅇ게/ec   ecNext ;
%_ㅇ깨/ec   ecNext ;
%_ㅇ께/ec   ecNext ;
가/ec   ecNext ;
가가/ec ecNext ;
```

* %가 붙은 것은 음절이 완성되지 않은 종성에 붙인다.


## 확인 필요1
 * '<' 의 의미
```
LEXICON neLexicon ! 영어
< Alphabet+ %/ne > neNext ;

```

## FILLC
  * %_%_의미
  * 종성이 없음을 표시하기 위한 것으로 보임(ex>바보 : '보'에 종성이 없음)
```
# morphrules.foma
define FILLC   %_%_   ; ! No-coda

```

## 종성 유무에 따라 조사가 달라지는 경우의 처리
* '사과'는 종성이 없기 때문에 '가'만 허용한다.( 즉 '이'가 붙을 수 없다. )
* '사람'은 종성이 있기 때문에 '이'만 허용한다.( 즉 '가'가 붙을 수 없다. )
* FILLC NounStringSet는 종성이 없는 명사에 붙는 명사태그를 의미한다( ex> 사과 )
* [Code - FILLC] NounStringSet는 종성이 있는 명세에 붙는 명사태그를 의미한다( ex> 사람 )
  * /ps  <- 주제격 조사 태그
```
! 이/가
! FilterPS0 : 사과가
define FilterPS0 ~$[ FILLC NounStringSet ㅇ ㅣ FILLC %/ps ] ;

! FilterPS1 : 사람이
define FilterPS1 ~$[ [Coda - FILLC] NounStringSet ㄱ ㅏ FILLC %/ps ] ;

```

## apply up && apply down

```
foma[0]: define test 가 나 -> 다 라;
defined test: 578 bytes. 3 states, 12 arcs, Cyclic.
foma[0]: regex test;
578 bytes. 3 states, 12 arcs, Cyclic.
foma[1]: up
apply up> 가나
???
apply up> 다라
다라
...
foma[1]: down
apply down> 가나
다라
apply down> 가 나
가 나

```
![test8](https://github.com/beyondnlp/nlp/raw/master/test8)


## test1
```
foma[0]: define A %_ㄷ -> %_ㄹ || _ %/irrd %/vb ㅇ ;
defined A: 898 bytes. 7 states, 30 arcs, Cyclic.
foma[0]: regex A ;
898 bytes. 7 states, 30 arcs, Cyclic.
foma[1]: view
```
![test1](https://github.com/beyondnlp/nlp/raw/master/test1)

## test2
```
foma[0]: define A [하나:one] ;
defined A: 235 bytes. 2 states, 1 arc, 1 path.
foma[0]: define B [one:いち] ;
defined B: 235 bytes. 2 states, 1 arc, 1 path.
foma[0]: define C [いち:一] ;
defined C: 235 bytes. 2 states, 1 arc, 1 path.
foma[1]: regex A .o. B .o. C ;
294 bytes. 2 states, 1 arc, 1 path.
foma[1]: view
```
* ![test3](https://github.com/beyondnlp/nlp/raw/master/test3) .o.
* ![test5](https://github.com/beyondnlp/nlp/raw/master/test5) .o. 
* ![test6](https://github.com/beyondnlp/nlp/raw/master/test6)

![test2](https://github.com/beyondnlp/nlp/raw/master/test2)

## notation
* define A a -> b || c _ d ;
* define 심볼명 현재상태 -> 다음상태 || 이전조건_다음조건
* m+1	다/EC	다/EF	SF	207832	4444
```
foma[0]: define test  다%/EC -> 다%/EF || _@%/SF;
defined test: 524 bytes. 3 states, 10 arcs, Cyclic.
foma[0]: regex test;
524 bytes. 3 states, 10 arcs, Cyclic.

```

![test4](https://github.com/beyondnlp/nlp/raw/master/test4)

## test3
foma> read lexc test3.lexc;

foma> view;
```
Multichar_Symbols   /ac /ad /ai /am         ! 부사
                  /di /dm /dn /du         ! 관형사
                  /ec /ed /ef /en /ep /ex ! 어미
                  /it                     ! 감탄사
                  /na /nc /nd /ni /nm /nn /np /nr /ns /nu /nb ! 체언
                  /pa /pc /pd /po /pp /ps /pt /pv /px /pq /pm ! 조사
                  /vb /vi /vj /vx /vn     ! 용언
                  /xa /xj /xn /xv         ! 접사
                  /sc /se /sf /sl /sr /sd /su /sy /so ! 심벌
                  /nh /ne /un ! 한자 (cHinese, English)
                  /irrL /irrb /irrd /irrh /irrl /irrs /irru ! 불규칙 코드표
                  %_ㄴ %_ㄹ %_ㅁ %_ㅂ %_ㅅ %_ㅇ ! 종성으로 시작 (어미, 조사)

Definitions
      Digit       = %0|1|2|3|4|5|6|7|8|9 ;
      Alphabet    = a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|
                    A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z ;
      Hanja       = 廓 ;

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

LEXICON Root
     ncLexicon ; ! 보통명사

LEXICON ncLexicon ! 보통명사
가건물/nc   ncNext ;
네이버/nc   ncNext ;

LEXICON ncNext
   pxpxLexicon ;
   finLexicon ;

LEXICON pxpxLexicon ! 보조사-보조사
까지/px만/px    pxpxNext ;
만/px도/px  pxpxNext ;
밖에/px도/px    pxpxNext ;
까지/px도/px    pxpxNext ;
조차/px도/px    pxpxNext ;
까지/px나/px    pxpxNext ;
뿐/px만/px  pxpxNext ;
마저/px도/px    pxpxNext ;
밖에/px%_ㄴ/px  pxpxNext ;
만/px으로/pa도/px   pxpxNext ;

LEXICON finLexicon
   # ;

```
![test7](https://github.com/beyondnlp/nlp/raw/master/test7)


## apply ordered rule
* \*에 대한 처리 ( simbol )을 정의하여 사용 
 * 다/EC 다/EF || _ \*/SF
* foma 자소분해 방법을 이해


## lexc & xfst
* lexc에 기술된 엔트리들은 기본적으로 음절이 edge로 생성이 된다.
* 여러 음절을 하나의 edge로 표현하기 위해서는 Multichar_Symbols에 따로 정의가 필요
![test7](https://github.com/beyondnlp/nlp/raw/master/test7)
* 이에 반해 xfst에 기술할때는 공백단위로 edge가 생성
![test8](https://github.com/beyondnlp/nlp/raw/master/test8)


* @는 오토마타에 표현되지 않는 심볼들의 여집합이다. 


# sample1
![sample1](https://github.com/beyondnlp/nlp/raw/master/sample1)
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 [ ㄷ | ㅏ ];

define n4 n1 .o. n2 ;
regex n4;

```


# sample2
![sample2](https://github.com/beyondnlp/nlp/raw/master/sample2)
## ~$[n3] 추가
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 [ ㄷ | ㅏ ];

define n4 n1 .o. n2 .o. ~$[n3];
regex n4;

```


# sample3
![sample3](https://github.com/beyondnlp/nlp/raw/master/sample3)
## ~$[n3] 추가
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 [ ㄷ  ];

define n4 n1 .o. n2 .o. ~$[n3];
regex n4;

```

# sample4
![sample4](https://github.com/beyondnlp/nlp/raw/master/sample4)
## ~$[n3] 추가
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 [ ㅏ ];

define n4 n1 .o. n2 .o. ~$[n3];
regex n4;

```



# complement 
![comp1](https://github.com/beyondnlp/nlp/raw/master/comp1)
## complement연산은 output string에만 적용된다.
## 연산순서에 따라 결과가 달라진다.
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 음 -> ㅇㅡㅁ;


define test n1 .o. n2 .o. n3;
regex test;

```

## complement 연산 후 오토마타
![comp2](https://github.com/beyondnlp/nlp/raw/master/comp2)
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 음 -> ㅇㅡㅁ;
define n4 [ ㄷ | ㅏ | ㅇ | %_ | ㅁ ];

define test n1 .o. n2 .o. n3 .o. ~$[n4];
regex test;

```

## 위 오토마타를 파일로 내리면(: write att > output.txt )
```
0   0   @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@
0   0   ㅇㅡㅁ  ㅇㅡㅁ
0   0   daum    daum
0   0   음  ㅇㅡㅁ
0   1   다  daum
1   0   음  @0@
0

```

## complement 연산 순서 변경
![comp3](https://github.com/beyondnlp/nlp/raw/master/comp3)
```
define n1 다 음 -> daum;
define n2 다 -> ㄷ ㅏ;
define n3 음 -> ㅇㅡㅁ;
define n4 [ ㄷ | ㅏ | ㅇ | %_ | ㅁ ];

define test ~$[n4] .o. n1 .o. n2 .o. n3;
regex test;

```

# insert operation test
![insert1](https://github.com/beyondnlp/nlp/raw/master/insert1)
* 심볼사이에 어떤 조건에서 다른 심볼을 추가할때 아래와 같이 사용
## [..] -> B || C _ D Insert a single B between C and D
```
define test [..] -> B || C _ D ;
regex test
apply down> CDA
CBDA
```


# optional insert operation test
![op_insert1](https://github.com/beyondnlp/nlp/raw/master/op_insert1)
* A심볼에서 B심볼을 생성
```
define test A (->) B;
regex test;
apply down> A
A
B
```


# replace operation test
![replace1](https://github.com/beyondnlp/nlp/raw/master/replace1)
* A심볼을 B심볼로 변경
```
define test A -> B;
regex test;
foma[1]: down
apply down> A
B
apply down> AAA
BBB
apply down> BBB
BBB
apply down> BAB
BBB
```