(pytorch) data class - beyondnlp/nlp GitHub Wiki
-
Dataset class๋ฅผ ์์ ( nsmc data )
- len๊ณผ getitem์ ๊ตฌํํ๋ค.
- len์ ์ ์ฒด ๋ฌธ์์
- getitem์ slice์ฐ์ฐ์ ์ง์ํ๊ธฐ ์ํ ์ฉ๋์ด๋ค.
-
nsmc sample
#id document label
9976970 ์ ๋๋น.. ์ง์ง ์ง์ฆ๋๋ค์ ๋ชฉ์๋ฆฌ 0
3819312 ํ ...ํฌ์คํฐ๋ณด๊ณ ์ด๋ฉ์ํ์ค....์ค๋ฒ์ฐ๊ธฐ์กฐ์ฐจ ๊ฐ๋ณ์ง ์๊ตฌ๋ 1
10265843 ๋๋ฌด์ฌ๋ฐ์๋ค๊ทธ๋์๋ณด๋๊ฒ์์ถ์ฒํ๋ค 0
9045019 ๊ต๋์ ์ด์ผ๊ธฐ๊ตฌ๋จผ ..์์งํ ์ฌ๋ฏธ๋ ์๋ค..ํ์ ์กฐ์ 0
6483659 ์ฌ์ด๋ชฌํ๊ทธ์ ์ต์ด์ค๋ฐ ์ฐ๊ธฐ๊ฐ ๋๋ณด์๋ ์ํ!์คํ์ด๋๋งจ์์ ๋์ด๋ณด์ด๊ธฐ๋ง ํ๋ ์ปค์คํด ๋์คํธ๊ฐ ๋๋ฌด๋๋ ์ด๋ป๋ณด์๋ค 1
5403919 ๋ง ๊ฑธ์๋ง ๋ 3์ธ๋ถํฐ ์ด๋ฑํ๊ต 1ํ๋
์์ธ 8์ด์ฉ์ํ.ใ
ใ
ใ
...๋ณ๋ฐ๊ฐ๋ ์๊น์. 0
7797314 ์์์ ๊ธด์ฅ๊ฐ์ ์ ๋๋ก ์ด๋ ค๋ด์ง๋ชปํ๋ค. 0
9443947 ๋ณ ๋ฐ๊ฐ๋ ์๊น๋ค ์๋์จ๋ค ์ด์๊ฒฝ ๊ธธ์ฉ์ฐ ์ฐ๊ธฐ์ํ์ด๋ช๋
์ธ์ง..์ ๋ง ๋ฐ๋กํด๋ ๊ทธ๊ฒ๋ณด๋จ ๋ซ๊ฒ๋ค ๋ฉ์น.๊ฐ๊ธ๋ง๋ฐ๋ณต๋ฐ๋ณต..์ด๋๋ผ๋ง๋ ๊ฐ์กฑ๋์๋ค ์ฐ๊ธฐ๋ชปํ๋์ฌ๋๋ง๋ชจ์ฟ๋ค 0
7156791 ์ก์
์ด ์๋๋ฐ๋ ์ฌ๋ฏธ ์๋ ๋ช์๋๋ ์ํ 1
- Dataset class ์์
class MovieDataSet(torch.utils.data.Dataset):
- vocab : ํค์๋๋ฅผ ์ ์ฅํ dict ๋ณ์
- infile : ํ์ต,ํ ์คํธ์ ์ฌ์ฉํ ํ์ผ๋ช
def __init__(self, vocab, infile):
self.vocab = vocab
self.labels = []
self.sentences = []
line_cnt = 0
with open(infile, "r") as f:
for line in f:
line_cnt += 1
with open(infile, "r") as f:
for line in f.readlines():
if line[0] == '#' : continue
line = line.rstrip()
term=line.split("\t");
ids = vocab.encode_as_ids(term[1])
label=[int(term[2])]
self.sentences.append( ids )
self.labels.append( label )
- len : ์ ์ฒด ๊ฑด์๋ฅผ ๊ฐ์ ธ์ค๊ธฐ ์ํ ํจ์
def __len__(self):
assert len(self.labels) == len(self.sentences)
return len(self.labels)
- getitem_ : slice์ฐ์ฐ์ ์ง์ํ๊ธฐ ์ํ ํจ์
- "[" "]"๋ฅผ ๊ตฌํ
def __getitem__(self, item):
label = torch.tensor( self.labels[item] )
sent = torch.tensor( self.sentences[item] )
return ( label, sent )
- batch ๋จ์๋ก ๋ฐ์ดํฐ๋ฅผ ๊บผ๋ด์ค๊ธฐ ์ํ ํจ์
def movie_collate_fn(inputs):
labels, enc_inputs, dec_inputs = list(zip(*inputs))
enc_inputs = torch.nn.utils.rnn.pad_sequence(enc_inputs, batch_first=True, padding_value=0)
dec_inputs = torch.nn.utils.rnn.pad_sequence(dec_inputs, batch_first=True, padding_value=0)
batch = [
torch.stack(labels, dim=0),
enc_inputs,
dec_inputs,
]
return batch
- ๊ฐ์ฒด์ ์ธ
train_dataset = MovieDataSet( vocab, train_file )
test_dataset = MovieDataSet( vocab, test_file )
- Dataloader์ ๊ฒฐํฉ
train_loader = \
torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=movie_collate_fn)
test_loader = \
torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=movie_collate_fn)
- ์ฌ์ฉ ( train_loader )
def train_epoch(config, epoch, model, criterion, optimizer, train_loader):
for i, value in enumerate(train_loader):