BERT MEETS RELATIONAL DB: LEARNING DEEP CONTEXTUAL REPRESENTATIONS OF RELATIONAL DATABASES - Songwooseok123/Study_Space GitHub Wiki

논문링크 :BERT MEETS RELATIONAL DB: LEARNING DEEP CONTEXTUAL REPRESENTATIONS OF RELATIONAL DATABASES (arXiv 2021)

1. RDB & Downstream Tasks

In this paper, we address the problem of learning embeddings of entities on relational databases consisting of multiple tables

[RDB Downstream Task]

Auto-completion of tables
Missing value prediction
Query processing of relational joins queries
Data integration
Join predictions
등등

✓ Quality of embedding ∝ Downstream tasks

-> RDB Entity들의 low dimension representation(Embedding) 을 잘 해야함. [RDB 구조와 Embedding의 어려움]

제목 없음

RDB는 schema를 가지고 있고 이를 통해 정규화되기 때문에 한 테이블에서 Entity들에 대한 관계가 잘 나타나지 않음.
- 테이블 간의 semantic관계를 고려해야됨.
Column 간의 관계를 학습하기 쉽지 않음.
- e.g. l_name: 지, f_name: 준영

거대 말뭉치로 학습된 기존 NLP 딥러닝 모델들은 테이블을 다루는 RDB에서는 효과적으로 쓰일 수가 없음

2. Related works

Table2Vec : 테이블들의 rows를 문장으로 다루고 word2vec 으로 학습시키는 방법
- 같은 entity면 어느 column에서든 같은 임베딩-> 다른 column에 있으면 의미도 달라야 되는데 고려되지않음.
  - e.g. 감독겸 배우인 이정재
Embdi(SOTA 모델) : 그래프 임배딩 방식의 random walks를 사용해서 column 사이의 관계를 파악함.
- FK-PK(테이블간의 semantic 관계) 파악을 못 함

3. In this work

[`RELBERT` 모델 제안]

"BERT 스타일의 모델을 이용해서 Column간의 semantic 관계와 테이블들간의 semantic 관계를 학습하는데 집중했다."

(2가지 Task에 대해서 각각 변형된 모델을 제시함)

1. RELBERT-A

Table Autocompletion(or missing value imputation)

2. RELBERT-J

Join Prediction

4. Problem Description

[데이터 표기법]

RDB : $D$
Set of tables : $T_({\alpha},{\beta},{\dots})\in D$
For each table $T_{•}$
- Table attributes(column) : $T_{•}^{(A,B\dots)}$
- and each tuple(row) is addressed via its primary key.
Number of columns in a table : $C_{T_•}$
Joined version of two tables $T_{\alpha}$ and $T_{\beta}$
- $T_\alpha\Join_{\tau_i,\tau_j} T_\beta$
- primary-key : $\tau_i\in T_\alpha$
- foreign-key : $\tau_j\in T_\beta$

4.1 Autocompletion

목표 : Get a ranked list of most probable candidates for the masked entity.

4.2 Join Prediction

목표: (두 테이블 $T_\alpha$, $T_\beta$가 $\tau_i(\in T_\alpha)$ 와 $\tau_j(\in T_\beta)$로 join 가능한 상황에서) $T_\alpha$ 의 tuple이 주어졌을 때, join될 수 있는 tuples of $T_\beta$ 을 예측하는 것

제목 없음

5. Proposed Model

5.1 Entity Embedding

Initialization : pre-train된 word2vec으로 table의 entity 모두 Embedding
Column Embedding : Entity들을 column별로 따로 Embedding한다

$$P_i^k=\Phi^k(T_i^k),\quad\forall k,i\in T$$

Table Encoding : self-attention 모델로 인풋 row의 entity들의 임베딩 벡터를 출력

$$o_i^k=TableEncoder(P_i^k)$$

k : k-th column , i : the row(sentence) index

5.2 MLM

목표 : Column 간의 관계를 학습한다.

Since each column k has a different embedding space, computed by $Φ_k$, we determine the candidate entities for the masked entity in the sentence using an output-softmax over the entities in the column of the masked entity .We learn the parameters for the horizontal transformer by optimizing the cross entropy loss function for each sentence in the table T as:

$$\mathcal{L_mlm} = \displaystyle\sum_{i=1}^{|T|}{CrossEntropy(O_i^m,P_i^{mask})}$$

$O_i$ = TEncoder( $P_i$ )
$O_i^m$ : output embeddings corresponding to the masked index
$P^{mask}$ : column entities

5.3 NSP

목표 : 테이블 간의 관계를 학습한다.
Join 될 수 있는 테이블 $T_\alpha$ , $T_\beta$ 가 있다.
- primary-key column: $T_\alpha^A$ , foreign-key column: $T_\beta^\hat{A}$
$\tau_i\in T_\alpha^A$ 가 $\tau_j\in T_\beta^{\hat{A}}$에 연결되어 있다면 후자의 row가 next sentence
[CLS]토큰의 output embedding을 사용

제목 없음

The loss for NSP $$\mathcal{L_nsp} = - \displaystyle\sum_{i=1}^{|T_\alpha|}\displaystyle\sum_{\tau_i\in T_\alpha^A}[log(\sigma(v_{\tau_i,\tau_j}))+log(1-\sigma(v_{\tau_i,\tau_j^{'}}))]$$
- $v_{\tau_i,\tau_j}$ : output embedding at the [cls] token
- $\tau_j^{'}∈T_\beta^\hat{A}$ : negative sampled row
- $\tau_j∈T_\beta^\hat{A}$ : true sampled row
- $\sigma$ : sigmoid function
- negative sampling objective function

5.4 RELBERT-A

목표 : Missing entity를 예측한다.

Missing entity의 후보들을 구하기 위해선 모든 테이블에 대한 정보가 필요하다.
모든 테이블들에 대한 정보를 얻기 위해서 Full denormalization이 필요하다.
- 하지만 테이블들 간의 모든 fk-pk를 계산하는 것은 cost가 너무 큼.
- 상관없는 Column을 묶을 수도 있고 table 사이즈를 크게 할 수도 있음.
이러한 병목현상을 없애기 위해 table을 실제로 join 하는 대신 NSP task를 통해 테이블간 관계를 학습하도록 했다.
1. Pre-training
- 두 테이블을 join하지 않은 상태에서 각 테이블에서 나온 row를 sentence 쌍으로 만들어서 MLM task 수행.
1. Fine-tuning
- pre-training 끝나고나서 $\mathcal{L_A}$ 를 optimizing하면서 masked entity를 예측하도록 fine tuning $$\mathcal{L_A} = \mathcal{L_mlm}+\mathcal{L_nsp}$$

5.5 RELBERT-J

목표 : $T_\alpha$ 의 tuple이 주어졌을 때, join될 수 있는 tuples of $T_\beta$ 을 예측하는 것

Pre-training: we run independent masked-language models on each table in the database.
- related columns을 가지는 테이블들 $T_\alpha^A$ , $T_\beta^{\hat{A}}$ 에 대해서, Entity embedding을 학습한다. $$P_\alpha=MLM(T_\alpha), P_\beta=MLM(T_\beta)$$
Fine-tuning 테이블 간의 관계를 파악하기 위해, NSP loss를 계산한다.( relational join을 가지는 테이블 간의 관계를 파악하기 위해.)

$$\mathcal{L_J}=NSP(P_\alpha,P_\beta,T_\alpha^A,T_\beta^{\hat{A}})$$

6. Experiments

6.1 Setup

Dataset : IMDB, MIMIC
Baseline Model : Table2Vec, EmbDi
Task
- Auto-completion or Missing value prediction
  - IMDB데이터셋에서 각각의 영화에 대한 감독 예측
  - MIMIC데이터셋에서 각각의 환자에 대한 DRGCODE를 예측
  - train:validation:test = 70:15:15
- join-prediction
  - IMDB데이터셋에서 Movies 테이블과 Movies_Directors 테이블 간의 관계(movie_id로 join 가능)에 대해 수행
Metric
Hitrate@10, MR(Mean Rank), MRR(Mean Reciprocal Rank)

6.2 Auto-completion

missing entitiy에 대한 후보들을 순위대로 예측
RELBERT-A가 Table2Vec을 훨씬 상회하는 성능을 가짐
- Column별로 임베딩 하는 방식이 한 row를 임베딩 하는 방식보다 더 좋게 작용
IMDB데이터셋에서는 RELBERT-A가 EmbDi에 비해 살짝 뒤쳐지는 성능을 가짐
- IMDB데이터셋이 더 많은 정보를 가지고 있어서 EmbDi방식의 random walk가 더 좋게 작용
- 하지만 random walk 방식은 RELBERT에 비해 계산 효율성이 떨어짐
[0~10] -> [movie_id, movies_name, movies_year, movies_rank, actors_last_name, movies_genre, actor_id, role, actors_first_name, directors_id, directors_genre]
self-attention 히트맵에서 단순히 같은 테이블에 있는 정보만이 아닌 다른 테이블의 column에 주목하고 있는 경향을 발견
또한 서로다른 헤드는 서로 다른 정보에 주목하고 있음을 확인

6.3 Join Prediction

이 작업은 비교할 연구가 없어서 DB의 standard join과 비교할 수 밖에 없다
신경망을 기반으로 테이블 간 조인 예측이 꽤 효과적으로 수행됨
특히 negative sample 수를 늘리면 더 잘 예측함

7. Conclusion & 느낀점

단순히 word2vec으로 임베딩 하던 기존 연구와는 달리 BERT기반의 모델을 도입
Column별 임베딩을 도입하여 row가 input으로 들어갈 때 각 컬럼별 information을 잃지 않음
실제 BERT처럼 pre-train된 모델을 불러와서 사용하는것이 아니라, Transformer의 Encoder Architecture를 사용했을 뿐이라서 RDB데이터를 이해하는데는 부족하다고 생각됨
실험결과가 크게 좋지 않아서 후행 연구가 계속 진행되어야 할 것으로 생각됨

[Future work]

column 데이터 내부의 관계를 파악하기 위해 vertical Transformer Encoder를 사용하여 학습
실제 DB에서 실행되는 SQL을 임베딩 하여 query-sensitive semantic join에 대해 학습

BERT MEETS RELATIONAL DB: LEARNING DEEP CONTEXTUAL REPRESENTATIONS OF RELATIONAL DATABASES - Songwooseok123/Study_Space GitHub Wiki

1. RDB & Downstream Tasks

✓ Quality of embedding ∝ Downstream tasks

2. Related works

3. In this work

[RELBERT 모델 제안]

4. Problem Description

4.1 Autocompletion

4.2 Join Prediction

5. Proposed Model

5.1 Entity Embedding

5.2 MLM

5.3 NSP

5.4 RELBERT-A

5.5 RELBERT-J

6. Experiments

6.1 Setup

6.2 Auto-completion

6.3 Join Prediction

7. Conclusion & 느낀점

[`RELBERT` 모델 제안]