Bert - HongkuanZhang/Technique-Notes GitHub Wiki
日语Bert(东北大学)的使用方法
上图中有些句子有问题,经过更改如下
import torch
from transformers.modeling_bert import BertForMaskedLM
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese-whole-word-masking')
model = BertForMaskedLM.from_pretrained('bert-base-japanese-whole-word-masking')
model.eval()
input_ids = tokenizer.encode(f"""外出できない時、家に{tokenizer.mask_token}するしかない。""", return_tensors='pt')
masked_index = torch.eq(input_ids,tokenizer.mask_token_id)[0].tolist().index(1)
result = model(input_ids)
answers = result[0][0,masked_index].topk(10)[1].tolist()
for a in answers:
output = input_ids[0].tolist()
output[masked_index] = a
print(tokenizer.decode(output))
关于huggingface中modeling.py文件中的源码知识点
- 参考资料
- 下面的所有代码都是源码中的内容,只要调用就好,不需要自己写。
- bert模型主要是由12个encoder堆叠而成,而每个encoder主要由三层layer构成: Self-attention layer, Intermediate layer, output layer。
- 每个encoder有两个输入: former layer ourput和包含padding信息的attention mask。former layer output的形状为(batch_size, seq_len, hidden_size), attention mask为(batch_size, 1, 1, seq_len)。
整理输入
- 首先要对输入的句子进行Embedding。对于Embedding层的输出, 其代码如下:
#输入为input_ids, token_type_ids 和 position_ids。三者维度均为(batch_size, seq_length)
#input_ids 为 [[12,45,34,98,...],[...],...] 这样通过word2id生成的句子bathces。
#token_type_ids 为 [[0,0,0,0,1,1,1,...],[...],...] 这样的bathces, 0表示当前词为第一个句子的单词,1表示为第二个句子的单词。
#positions_ids为[[0,1,2,......,seq_length - 1],[...],...] 这样的bathces, 数字表示词在句子中的位置。
#实际上模型的外部输入只需要前两个,最后的位置id源码中已经为我们生成
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
#变成和input_ids一样的形状
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
#如果输入token_type_ids为None的话,则默认整个输入都是a句。(适用于NER任务)
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
#源码中的embedding层会生成三种embedding, 形状都为(batch_size, seq_len, hidden_size)
words_embeddings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
#三种embedding相加
embeddings = words_embeddings + position_embeddings + token_type_embeddings
#相加后的embeddings输入到layerNorm层和dropout层进行处理,得到最后的输出并返回
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
其中LayerNorm的函数原理如下:
u = x.mean(-1, keepdim=True)
s = (x - u).pow(2).mean(-1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.variance_epsilon)
return self.weight * x + self.bias
#上述的代码其实就是这个公式:(x-u)/standard deviation,其中x是向量,u是标量(均值),分母代表标准差。(参考概率论详解此公式含义)
#代码中的variance_epsilon是一个很小的数,作用是为了防止分母(标准差)为0.
- 除了上面提到的输入, 模型还需要一个叫attention_mask的输入, 我们需要提供的输入如下:
attention_masks = [[float(i>0) for i in ii] for ii in input_ids]
而在源码中关于这个输入被转换为extended_attention_mask作为输入:
#首先转换attention_mask形状为(batch_size, 1, 1, to_seq_length)赋值给extended_attention_mask
#然后, 在原本的mask中,1代表有用信息,0代表填充信息, 下面的这句代码将其变换为:0代表有用信息,-10000代表填充信息。(这个操作是为了后面的softmax, 后面会给解释)
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
进入Transformer层
- 从Embedding层输出后, 进入到encoder层
#从embeddings层得到输出,然后送进encoder层,得到最后的输出encoder_layers
embedding_output = self.embeddings(input_ids, token_type_ids)
encoded_layers = self.encoder(embedding_output,
extended_attention_mask, output_all_encoded_layers = output_all_encoded_layers)
- 关于BertEncoder(包含了若干层encoder)的详细解释
#BertEncoder中包含若干层(论文中base为12层,large为24层)encoder,每层encoder在代码中就是一个BertLayer。
#下面的代码首先声明了一层layer,然后构造了num_hidden_layers(12 or 24)层相同的layer放在一个列表self.layer中
layer = BertLayer(config)
self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
#下面看其forward函数
#hidden_states是embedding_output, 维度为(batch_size, seq_len, hidden_size)
#attention_mask:维度(batch_size, 1, 1, seq_len)
#output_all_encoder_layers:输出模式, 输出最后一个encoder的output或者全部encoder的output
def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
#将每一个encoder的输出作为输入输给下一层的encoder,直到12(or24)层循环完毕
all_encoder_layers = []
#遍历所有的encoder,总共有12层或者24层
for layer_module in self.layer:
#每一层的输出hidden_states也是下一层layer_moudle(BertLayer)的输入,这样就连接起来了各层encoder。第一层的输入是embedding_output
hidden_states = layer_module(hidden_states, attention_mask)
#如果output_all_encoded_layers == True:则将每一层的结果添加到all_encoder_layers中
if output_all_encoded_layers:
all_encoder_layers.append(hidden_states)
#如果output_all_encoded_layers == False, 则只将最后一层的输出加到all_encoded_layers中
if not output_all_encoded_layers:
all_encoder_layers.append(hidden_states)
return all_encoder_layers
- 关于BertLayer, hidden_state进入BertLayer后主要经过了三个层
#1、BertAttention Layer
attention_output = self.attention(hidden_states, attention_mask)
#2、Intermediate Layer
intermediate_output = self.intermediate(attention_output)
#3、Output Layer
layer_output = self.output(intermediate_output, attention_output)
- BertAttention Layer
#Attention的输入是两个:一个是input_tensor hidden_states(第一层是embedding_output), 维度为(batch_size, seq_len, hidden_size)
#另一个则是attention_mask, 维度为(batch_size, 1, 1, seq_len)
#input tensor进入BertAttention层之后,首先进入BertSelfAttention层,再连接一个BertSelfOutput层,然后得到输出
def forward(self, input_tensor, attention_mask):
self_output = self.attention(input_tensor, attention_mask) #BertSelfAttention层
attention_output = self.output(self_output, input_tensor) #BertSelfOutput层
return attention_output
############################################################################################################
#下面是关于BertSelfAttention层
#头的数目,代码中给定为12
self.num_attention_heads = config.num_attention_heads
#attention_hidden_size:每个头的大小,用总大小(hidden_size,768)除以总头数获得,即768/12=64
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
#all_head_size在这里和hidden_size的大小是相同的(768)
self.all_head_size = self.num_attention_heads * self.attention_head_size
#这里声明了query, key, value的三个hidden_size * all_head_size(768*768)大小的矩阵
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
#dropout层
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
#下面是关于self attention的计算方法
#输入1 hidden_states:(batch_size, seq_len, word_dim = hidden_size = 768)
#输入2 attention_mask:(batch_size, 1, 1, seq_len)
def forward(self, hidden_states, attention_mask):
#简单提一下query, key,value的作用。query和key用来确定注意力权重,value根据权重求加权和得到attention输出
#首先是经过简单的矩阵相乘处理(这些矩阵是要训练的)
#下面三行均是(batch_size, seq_len, hidden_size)*(hidden_size, hidden_size)
#output形状为(batch_size, seq_len, hidden_size)
mixed_query_layer = self.query(hidden_states)
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
#下面的self.transpose_for_scores把(batch_size, seq_length, hidden_size=768)变成了
#(batch_size, num_attention_heads=12, seq_len, attention_head_size=64)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
#下面四行代码(注意是四行代码)是计算权重用的。
#首先query和key相乘,得到的矩阵A形状是(batch_size, num_attention_heads, seq_length, seq_length)
#只关注A的后两维,那么A[i][j]则代表第j个单词对第i的单词的影响(注意力)权重。以"I am so handsome"为例:
# I am so handsome
# I 3 4 -10 3
# am 4 6 9 1
# so 2 4 1 2
# handsome 3 12 1 0
#从图中可知am对so的影响权重为4。(A[2][1])
#num_attention_heads代表有num_attention_heads个这样的transformer,则就有num_attention_heads个这样的权重矩阵了。
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
#对权重因子做一个简单的处理
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
#得到的分数又加上了attention_mask,再做softmax,算出最后的权重。
#回忆下attention_mask的性质, 维度为(batch_size,1,1,seq_len), 0代表有用的信息,-10000代表无用或者padding信息。
#那么,为什么是-10000呢?
#我们不妨假设上例中的handsome是padding是填充的,那么这个handsome就是无用信息,attention_mask = [0,0,0,-10000]
#加到上面的矩阵后,我们发现最后一列都变得很小,分别是-9997,-9999,-9998,-10000,其他三列加的是0,所以值不变。
#然后用相加的值做softmax, 以第一行为例,第一行是(3,4,-10,-9997)
#然后softmax之后,e的-9997次方接近0,这样handsome对I的影响就接近为0
#所以-10000的含义是为了消除padding单词对其他单词的影响
attention_scores = attention_scores + attention_mask
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# 然后通过一个dropout输出
attention_probs = self.dropout(attention_probs)
#得到的结果再乘以value求加权和
#输出形状为(batch_size, num_attention_heads, seq_leng, attention_head_size)
context_layer = torch.matmul(attention_probs, value_layer)
#下面的三行是将(batch_size, num_attention_heads, seq_len, attention_head_size)形状转化为
#(batch_size, seq_len, all_head_size),回到了最初的起点……
context_layer = context_layer.permute(0, 2, 1, 3).contiguous() # permute调换index1和index2的维度, contiguous是为了后面的view操作, 一般permute后面都加上contiguous来做view操作。
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) # new_context_layer_shape = torch.Size([batch_size, seq_len, all_head_size])
context_layer = context_layer.view(*new_context_layer_shape) # Shape变换, 这里我自己尝试括号里不加星号也可以变换。
return context_layer
#从selfattention出来之后,进入BertSelfOutput层,这个层包含了三个层
#1、全连接层 2、dropout层 3、layernormer层
#最后输出的维度是(batch_size, seq_leagth, hidden_size=768)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor) #残差连接之后进行layer normalization
return hidden_states
- Intermediate Layer
#从BertAttention出来之后,又进入了一个BertIntermediate的层,这个层主要包含两个sub-layer
#一个全连接层和一个激活层
#输入为(batch_size, seq_len, hidden_size = 768)
def forward(self, hidden_states):
#[batch_size, seq_length, all_head_size = 768] * [hidden_size, intermediate_size = 4*768](论文中设置的参数)
hidden_states = self.dense(hidden_states)
#下面是激活函数,具体的选取看该类的init方法。
hidden_states = self.intermediate_act_fn(hidden_states)
#然后返回,形状变成了(batch_size, seq_len,intermediate_size=4*768)
return hidden_states
- Output Layer
#和BertSelfOutput层一样
#输入[batch_size, seq_length,intermediate_size=4*768]
#输出[batch_size, seq_length,hidden_size=768]
class BertOutput(nn.Module):
def __init__(self, config):
super(BertOutput, self).__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
整理输出
#此时,我们将视线再次转回到BertModel模块
#我们得到的输出和output_all_encoded_layes这个参数相关
#如果output_all_encoded_layes==True,我们得到所有层encoder的输出
#如果output_all_encoded_layes==False,我们得到最后一层encoder的输出
encoded_layers = self.encoder(embedding_output,
extended_attention_mask,output_all_encoded_layers=output_all_encoded_layers)
#取出最后一层的输出
sequence_output = encoded_layers[-1]
#最后一层的输出经过Pooler Layer,得到pooled_output,这个Layer会在下面说明
pooled_output = self.pooler(sequence_output)
if not output_all_encoded_layers:
encoded_layers = encoded_layers[-1]
return encoded_layers, pooled_output
- Pooler Layer
#由上面的讲解可知,pooler的输入是最后一层encoder的输出, (batch_size, seq_len, hidden_size)
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
#取出每一句的第一个单词,做全连接和激活。
#得到的输出可以用来分类等下游任务(此处用句子分类任务举例,即将每个句子的第一个单词的表示作为整个句子的表示)
first_token_tensor = hidden_states[:, 0] #在seq_len维度上取出每个句子的第一个词的hidden states, 输出形状为(batch_size, hidden_size)
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
基于Bert实现NER
在transformers里,作者在/examples/ner中提供了命名实体识别所使用的py文件以及使用方法,这里对其使用方法进行了大致的归纳,所有的操作都在terminal中进行。
Fine-tune BERT on NER task using customized data
Data Preparation
- 我们首先需要准备train, dev, test的三个txt文件,文件中的格式如下:
# 每一行为token和其对应的label,并且两者之间用space隔开,两句话之间用回车隔开。
北京 B-LOC
は O
中国 B-LOC
の O
首都 O
です O
。 O
今日 O
は O
...
- 然后下载三个所需文件
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
wget "https://raw.githubusercontent.com/huggingface/transformers/master/examples/ner/run_ner.py"
wget "https://raw.githubusercontent.com/huggingface/transformers/master/examples/ner/utils_ner.py"
其中,preprocess.py为数据预处理文件,其功能是:1)过滤一些特殊字符(GermEval 2014 dataset中存在的,一般我们自己制作的数据不存在), 2)根据max_len对长句子进行分割。
- 设置变量
export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased
- 对数据预处理
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
- 把所有label存入文件中
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
# 生成的labels.txt文件中包含所有unique的labels
环境变量的设置
export OUTPUT_DIR=germeval-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=1
开始训练
python3 run_ner.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
解析run_ner.py
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """
import logging
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score #直接提供预测的和正确的两个list就能返回对应的f1,p,r.
from torch import nn
from transformers import (
AutoConfig,
AutoModelForTokenClassification,
AutoTokenizer,
EvalPrediction,
HfArgumentParser,
Trainer,
TrainingArguments,
set_seed,
)
from utils_ner import NerDataset, Split, get_labels
logger = logging.getLogger(__name__)
@dataclass
class ModelArguments:
"""
设置模型相关参数,主要是预训练模型的路径/名称,tokenizer的名称如果不提供则与模型同名.
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
)
@dataclass
class DataTrainingArguments:
"""
设置输入数据相关参数,包括.txt文件位置, label.txt文件位置, max_seq_length.
Arguments pertaining to what data we are going to input our model for training and eval.
"""
data_dir: str = field(
metadata={"help": "The input data dir. Should contain the .txt files for a CoNLL-2003-formatted task."}
)
labels: Optional[str] = field(
metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."}
)
max_seq_length: int = field(
default=128,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
def main():
# 模型训练相关参数的名称在training_args.py中可见(非常详细)
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) # 生成参数解析器
model_args, data_args, training_args = parser.parse_args_into_dataclasses() # 解析3种类型的参数
# 检测输出文件是否存在,并且保证为空文件夹或者设置overwrite_output_dir == True
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
)
# 设置logging的级别(Warning或者Information级别以上把信息输出在命令行里)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
training_args.local_rank,
training_args.device,
training_args.n_gpu,
bool(training_args.local_rank != -1),
training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)
# Set seed
set_seed(training_args.seed)
# Prepare CONLL-2003 task
labels = get_labels(data_args.labels) # 生成['B-ORG','I-ORG',....]这样的包含所有unique labels的list.
label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)} # 生成id2label的字典
num_labels = len(labels)
# 加载预训练模型,日语的模型(如东北大的)需要对from_pretained进行更改, 不使用AutoConfig, 且tokenizer以及模型的加载分别为
# BertJapaneseTokenizer.from_pretrained()以及BertForTokenClassification.from_pretrained() 注意import的时候也要有所更改
# Load pretrained model and tokenizer
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config = AutoConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
num_labels=num_labels,
id2label=label_map,
label2id={label: i for i, label in enumerate(labels)},
cache_dir=model_args.cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=model_args.use_fast,
)
model = AutoModelForTokenClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
)
# 这里用NerDataset类来加载训练集和验证集
# NerDataset这个类对tokens生成相应的labels如下(详细见utils_ner.py):
# tokens: [CLS] is this jack ##son ##ville ? [SEP]
# labels: [pad_ken_label_id] 'O' 'O' 'B-LOC' [pad_ken_label_id] [pad_ken_label_id] 'O' [pad_ken_label_id]
# 即特殊字符如[CLS]、[SEP],以及一个token中被tokenized的所有tokens中第一位以外的token(##son,##ville)所对应的label为[pad_ken_label_id],后面进行loss计算时也会忽略这些token
train_dataset = (
NerDataset(
data_dir=data_args.data_dir,
tokenizer=tokenizer,
labels=labels,
model_type=config.model_type,
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.train, # 返回'train'这个字符串
local_rank=training_args.local_rank,
)
if training_args.do_train
else None
)
eval_dataset = (
NerDataset(
data_dir=data_args.data_dir,
tokenizer=tokenizer,
labels=labels,
model_type=config.model_type,
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.dev,
local_rank=training_args.local_rank,
)
if training_args.do_eval
else None
)
# 这个函数主要是把模型输出的预测label sequence和真实的label sequence中不等于ignore_index (-100) 的label index抽取出来
def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
preds = np.argmax(predictions, axis=2)
batch_size, seq_len = preds.shape
out_label_list = [[] for _ in range(batch_size)]
preds_list = [[] for _ in range(batch_size)]
for i in range(batch_size):
for j in range(seq_len):
if label_ids[i, j] != nn.CrossEntropyLoss().ignore_index:
out_label_list[i].append(label_map[label_ids[i][j]])
preds_list[i].append(label_map[preds[i][j]])
return preds_list, out_label_list
#此函数主要是对模型的结果进行评价(f1,p,r), 采用的方法是把所有生成sequences合并成一个长list, 相应正确sequences也合并成长list, 然后对这两个list计算三个评价指标
def compute_metrics(p: EvalPrediction) -> Dict:
preds_list, out_label_list = align_predictions(p.predictions, p.label_ids)
return {
"precision": precision_score(out_label_list, preds_list),
"recall": recall_score(out_label_list, preds_list),
"f1": f1_score(out_label_list, preds_list),
}
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
)
trainer.save_model()
# For convenience, we also re-save the tokenizer to the same directory,
# so that you can share your model easily on huggingface.co/models =)
if trainer.is_world_master():
tokenizer.save_pretrained(training_args.output_dir)
# Evaluation
results = {}
if training_args.do_eval and training_args.local_rank in [-1, 0]:
logger.info("*** Evaluate ***")
result = trainer.evaluate()
output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
results.update(result)
# Predict
if training_args.do_predict and training_args.local_rank in [-1, 0]:
test_dataset = NerDataset(
data_dir=data_args.data_dir,
tokenizer=tokenizer,
labels=labels,
model_type=config.model_type,
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.test,
local_rank=training_args.local_rank,
)
predictions, label_ids, metrics = trainer.predict(test_dataset)
preds_list, _ = align_predictions(predictions, label_ids)
output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt") #打印f1,p,r数值
with open(output_test_results_file, "w") as writer:
for key, value in metrics.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
# Save predictions
output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
with open(output_test_predictions_file, "w") as writer:
with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
example_id = 0 # 表示当前例子(句子)的id
for line in f:
if line.startswith("-DOCSTART-") or line == "" or line == "\n": #是否要到下一句了
writer.write(line)
if not preds_list[example_id]: # 先看下面,由于每次都从预测的label list中pop第一个元素,当全部label都pop完之后
example_id += 1 # 在到下一句之前,preds_list[example_id]等于[],以此作为判定条件把例子(句子)id加上1
elif preds_list[example_id]:
output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
writer.write(output_line)
else:
logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
return results
if __name__ == "__main__":
main()
关于输入自定义格式的数据的小记
- 在很多信息抽取的任务中,我们需要指定bert的特定位置的输出之间进行计算,例如在做事件抽出或者实体关系抽出的任务时,需要对每个example定位实体的位置,然后对相应位置的bert输出向量进行一些处理,如指定两位置实体之间的关系判定时,需要把这两个位置的输出concat之后输入到关系分类器中。
- 因此,每个example中各类别的实体的位置信息,以及原本实体的index和被tokenize之后的位置之间的对应位置词典是必要的。但是对于默认的输入只有input_ids, token_type_ids以及labels等,而装有位置信息的list以及装有对应关系的dict应该如何传入给模型呢?
- 其实主要是更改两个地方,一个是要自定义Dataset,读取输入的特征时加入entity_locations和old_new_index_dict等features。
- 另一个是要创建一个自定义的继承DataCollator的类,因为在huggingface的代码中,DataLoader加载Dataset的时候,其实是DataCollator在指定加载什么形状的输入给DataLoader,因此我们创建的继承DataCollator的类要指定一个batch的数据中有哪几种数据,每种数据是什么形状。最后就会生成一个{str:Any}的字典,Any既可以是一个batch的tensor(形状为(b_s,max_len,hidden_dim)),也可以是指定的数据类型如一个batch的实体位置(list of lists)或者一个batch的实体index对应关系字典(list of dicts)。
- 这样,DataLoader每次会输出一个这样的字典input,然后通过
model(**input)
把他们根据名称传给model,然后model拿到这些数据再进一步操作。