Bert - HongkuanZhang/Technique-Notes GitHub Wiki

日语Bert(东北大学)的使用方法

上图中有些句子有问题,经过更改如下

import torch
from transformers.modeling_bert import BertForMaskedLM
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer

tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese-whole-word-masking')
model = BertForMaskedLM.from_pretrained('bert-base-japanese-whole-word-masking')
model.eval()

input_ids = tokenizer.encode(f"""外出できない時、家に{tokenizer.mask_token}するしかない。""", return_tensors='pt')
masked_index = torch.eq(input_ids,tokenizer.mask_token_id)[0].tolist().index(1)

result = model(input_ids)
answers = result[0][0,masked_index].topk(10)[1].tolist()

for a in answers:
    output = input_ids[0].tolist()
    output[masked_index] = a
    print(tokenizer.decode(output))

关于huggingface中modeling.py文件中的源码知识点

  • 参考资料
  • 下面的所有代码都是源码中的内容,只要调用就好,不需要自己写。
  • bert模型主要是由12个encoder堆叠而成,而每个encoder主要由三层layer构成: Self-attention layer, Intermediate layer, output layer。
  • 每个encoder有两个输入: former layer ourput和包含padding信息的attention mask。former layer output的形状为(batch_size, seq_len, hidden_size), attention mask为(batch_size, 1, 1, seq_len)。

整理输入

  1. 首先要对输入的句子进行Embedding。对于Embedding层的输出, 其代码如下:
#输入为input_ids, token_type_ids 和 position_ids。三者维度均为(batch_size, seq_length)
#input_ids 为 [[12,45,34,98,...],[...],...] 这样通过word2id生成的句子bathces。
#token_type_ids 为 [[0,0,0,0,1,1,1,...],[...],...] 这样的bathces, 0表示当前词为第一个句子的单词,1表示为第二个句子的单词。
#positions_ids为[[0,1,2,......,seq_length - 1],[...],...] 这样的bathces, 数字表示词在句子中的位置。

#实际上模型的外部输入只需要前两个,最后的位置id源码中已经为我们生成
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
#变成和input_ids一样的形状
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)

#如果输入token_type_ids为None的话,则默认整个输入都是a句。(适用于NER任务)
if token_type_ids is None:
    token_type_ids = torch.zeros_like(input_ids)
 
#源码中的embedding层会生成三种embedding, 形状都为(batch_size, seq_len, hidden_size)
words_embeddings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)

#三种embedding相加
embeddings = words_embeddings + position_embeddings + token_type_embeddings

#相加后的embeddings输入到layerNorm层和dropout层进行处理,得到最后的输出并返回
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings

  其中LayerNorm的函数原理如下:

u = x.mean(-1, keepdim=True)
s = (x - u).pow(2).mean(-1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.variance_epsilon)
return self.weight * x + self.bias
#上述的代码其实就是这个公式:(x-u)/standard deviation,其中x是向量,u是标量(均值),分母代表标准差。(参考概率论详解此公式含义)
#代码中的variance_epsilon是一个很小的数,作用是为了防止分母(标准差)为0.
  1. 除了上面提到的输入, 模型还需要一个叫attention_mask的输入, 我们需要提供的输入如下:
attention_masks = [[float(i>0) for i in ii] for ii in input_ids]

  而在源码中关于这个输入被转换为extended_attention_mask作为输入:

#首先转换attention_mask形状为(batch_size, 1, 1, to_seq_length)赋值给extended_attention_mask
#然后, 在原本的mask中,1代表有用信息,0代表填充信息, 下面的这句代码将其变换为:0代表有用信息,-10000代表填充信息。(这个操作是为了后面的softmax, 后面会给解释)
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

进入Transformer层

  1. 从Embedding层输出后, 进入到encoder层
#从embeddings层得到输出,然后送进encoder层,得到最后的输出encoder_layers
embedding_output = self.embeddings(input_ids, token_type_ids)
encoded_layers = self.encoder(embedding_output,
extended_attention_mask, output_all_encoded_layers = output_all_encoded_layers)
  1. 关于BertEncoder(包含了若干层encoder)的详细解释
#BertEncoder中包含若干层(论文中base为12层,large为24层)encoder,每层encoder在代码中就是一个BertLayer。

#下面的代码首先声明了一层layer,然后构造了num_hidden_layers(12 or 24)层相同的layer放在一个列表self.layer中
layer = BertLayer(config)
self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])

#下面看其forward函数
#hidden_states是embedding_output, 维度为(batch_size, seq_len, hidden_size)
#attention_mask:维度(batch_size, 1, 1, seq_len)
#output_all_encoder_layers:输出模式, 输出最后一个encoder的output或者全部encoder的output

def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):

#将每一个encoder的输出作为输入输给下一层的encoder,直到12(or24)层循环完毕
    all_encoder_layers = []
    #遍历所有的encoder,总共有12层或者24层
    for layer_module in self.layer:
    #每一层的输出hidden_states也是下一层layer_moudle(BertLayer)的输入,这样就连接起来了各层encoder。第一层的输入是embedding_output
        hidden_states = layer_module(hidden_states, attention_mask)
    #如果output_all_encoded_layers == True:则将每一层的结果添加到all_encoder_layers中
        if output_all_encoded_layers:
           all_encoder_layers.append(hidden_states)
    #如果output_all_encoded_layers == False, 则只将最后一层的输出加到all_encoded_layers中
    if not output_all_encoded_layers:
        all_encoder_layers.append(hidden_states)
    return all_encoder_layers
  1. 关于BertLayer, hidden_state进入BertLayer后主要经过了三个层
#1、BertAttention Layer
attention_output = self.attention(hidden_states, attention_mask)
#2、Intermediate Layer
intermediate_output = self.intermediate(attention_output)
#3、Output Layer
layer_output = self.output(intermediate_output, attention_output)
  • BertAttention Layer
#Attention的输入是两个:一个是input_tensor hidden_states(第一层是embedding_output), 维度为(batch_size, seq_len, hidden_size)
#另一个则是attention_mask, 维度为(batch_size, 1, 1, seq_len)
#input tensor进入BertAttention层之后,首先进入BertSelfAttention层,再连接一个BertSelfOutput层,然后得到输出
def forward(self, input_tensor, attention_mask):
    self_output = self.attention(input_tensor, attention_mask)     #BertSelfAttention层
    attention_output = self.output(self_output, input_tensor)  #BertSelfOutput层
    return attention_output
############################################################################################################
#下面是关于BertSelfAttention层
#头的数目,代码中给定为12
    self.num_attention_heads = config.num_attention_heads
    #attention_hidden_size:每个头的大小,用总大小(hidden_size,768)除以总头数获得,即768/12=64
    self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
    #all_head_size在这里和hidden_size的大小是相同的(768)
    self.all_head_size = self.num_attention_heads * self.attention_head_size
    
    #这里声明了query, key, value的三个hidden_size * all_head_size(768*768)大小的矩阵
    self.query = nn.Linear(config.hidden_size, self.all_head_size)
    self.key = nn.Linear(config.hidden_size, self.all_head_size)
    self.value = nn.Linear(config.hidden_size, self.all_head_size)
    #dropout层
    self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
 
#下面是关于self attention的计算方法
#输入1 hidden_states:(batch_size, seq_len, word_dim = hidden_size = 768)
#输入2 attention_mask:(batch_size, 1, 1, seq_len)
    
def forward(self, hidden_states, attention_mask):
    #简单提一下query, key,value的作用。query和key用来确定注意力权重,value根据权重求加权和得到attention输出
    #首先是经过简单的矩阵相乘处理(这些矩阵是要训练的)
    #下面三行均是(batch_size, seq_len, hidden_size)*(hidden_size, hidden_size)
    #output形状为(batch_size, seq_len, hidden_size)
    mixed_query_layer = self.query(hidden_states)
    mixed_key_layer = self.key(hidden_states)
    mixed_value_layer = self.value(hidden_states)
        
    #下面的self.transpose_for_scores把(batch_size, seq_length, hidden_size=768)变成了
    #(batch_size, num_attention_heads=12, seq_len, attention_head_size=64)
    query_layer = self.transpose_for_scores(mixed_query_layer)
    key_layer = self.transpose_for_scores(mixed_key_layer)
    value_layer = self.transpose_for_scores(mixed_value_layer)
 
    #下面四行代码(注意是四行代码)是计算权重用的。
    #首先query和key相乘,得到的矩阵A形状是(batch_size, num_attention_heads, seq_length, seq_length)
    #只关注A的后两维,那么A[i][j]则代表第j个单词对第i的单词的影响(注意力)权重。以"I am so handsome"为例:
    #                                   I    am    so    handsome
    #                               I   3    4     -10    3
    #                               am  4    6     9      1
    #                               so  2    4     1      2
    #                         handsome  3    12    1      0
    #从图中可知am对so的影响权重为4。(A[2][1])
    #num_attention_heads代表有num_attention_heads个这样的transformer,则就有num_attention_heads个这样的权重矩阵了。
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        
    #对权重因子做一个简单的处理
    attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        
    #得到的分数又加上了attention_mask,再做softmax,算出最后的权重。
    #回忆下attention_mask的性质, 维度为(batch_size,1,1,seq_len), 0代表有用的信息,-10000代表无用或者padding信息。
    #那么,为什么是-10000呢?
    #我们不妨假设上例中的handsome是padding是填充的,那么这个handsome就是无用信息,attention_mask = [0,0,0,-10000]
    #加到上面的矩阵后,我们发现最后一列都变得很小,分别是-9997,-9999,-9998,-10000,其他三列加的是0,所以值不变。
    #然后用相加的值做softmax, 以第一行为例,第一行是(3,4,-10,-9997)
    #然后softmax之后,e的-9997次方接近0,这样handsome对I的影响就接近为0
    #所以-10000的含义是为了消除padding单词对其他单词的影响
    attention_scores = attention_scores + attention_mask
    attention_probs = nn.Softmax(dim=-1)(attention_scores)
          
    # 然后通过一个dropout输出
    attention_probs = self.dropout(attention_probs)
        
    #得到的结果再乘以value求加权和
    #输出形状为(batch_size, num_attention_heads, seq_leng, attention_head_size)
    context_layer = torch.matmul(attention_probs, value_layer)

    #下面的三行是将(batch_size, num_attention_heads, seq_len, attention_head_size)形状转化为
    #(batch_size, seq_len, all_head_size),回到了最初的起点……
    context_layer = context_layer.permute(0, 2, 1, 3).contiguous() # permute调换index1和index2的维度, contiguous是为了后面的view操作, 一般permute后面都加上contiguous来做view操作。
    new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) # new_context_layer_shape = torch.Size([batch_size, seq_len, all_head_size])
    context_layer = context_layer.view(*new_context_layer_shape) # Shape变换, 这里我自己尝试括号里不加星号也可以变换。
    return context_layer

#从selfattention出来之后,进入BertSelfOutput层,这个层包含了三个层
#1、全连接层 2、dropout层 3、layernormer层
#最后输出的维度是(batch_size, seq_leagth, hidden_size=768)
def forward(self, hidden_states, input_tensor):
    hidden_states = self.dense(hidden_states)
    hidden_states = self.dropout(hidden_states)
    hidden_states = self.LayerNorm(hidden_states + input_tensor) #残差连接之后进行layer normalization
    return hidden_states
  1. Intermediate Layer
#从BertAttention出来之后,又进入了一个BertIntermediate的层,这个层主要包含两个sub-layer
#一个全连接层和一个激活层
#输入为(batch_size, seq_len, hidden_size = 768)
def forward(self, hidden_states):
    #[batch_size, seq_length, all_head_size = 768] * [hidden_size, intermediate_size = 4*768](论文中设置的参数)
    hidden_states = self.dense(hidden_states)
    #下面是激活函数,具体的选取看该类的init方法。
    hidden_states = self.intermediate_act_fn(hidden_states)
    #然后返回,形状变成了(batch_size, seq_len,intermediate_size=4*768)
    return hidden_states
  1. Output Layer
#和BertSelfOutput层一样
#输入[batch_size, seq_length,intermediate_size=4*768]
#输出[batch_size, seq_length,hidden_size=768]
class BertOutput(nn.Module):
    def __init__(self, config):
        super(BertOutput, self).__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
 
    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

整理输出

#此时,我们将视线再次转回到BertModel模块
#我们得到的输出和output_all_encoded_layes这个参数相关
#如果output_all_encoded_layes==True,我们得到所有层encoder的输出
#如果output_all_encoded_layes==False,我们得到最后一层encoder的输出
encoded_layers = self.encoder(embedding_output,
extended_attention_mask,output_all_encoded_layers=output_all_encoded_layers)

#取出最后一层的输出
sequence_output = encoded_layers[-1]

#最后一层的输出经过Pooler Layer,得到pooled_output,这个Layer会在下面说明
pooled_output = self.pooler(sequence_output)
if not output_all_encoded_layers:
    encoded_layers = encoded_layers[-1]
return encoded_layers, pooled_output
  • Pooler Layer
#由上面的讲解可知,pooler的输入是最后一层encoder的输出, (batch_size, seq_len, hidden_size)
def forward(self, hidden_states):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token.
        
    #取出每一句的第一个单词,做全连接和激活。
    #得到的输出可以用来分类等下游任务(此处用句子分类任务举例,即将每个句子的第一个单词的表示作为整个句子的表示)
    first_token_tensor = hidden_states[:, 0] #在seq_len维度上取出每个句子的第一个词的hidden states, 输出形状为(batch_size, hidden_size)
    pooled_output = self.dense(first_token_tensor)
    pooled_output = self.activation(pooled_output)
    return pooled_output

基于Bert实现NER

在transformers里,作者在/examples/ner中提供了命名实体识别所使用的py文件以及使用方法,这里对其使用方法进行了大致的归纳,所有的操作都在terminal中进行。

Fine-tune BERT on NER task using customized data

Data Preparation

  1. 我们首先需要准备train, dev, test的三个txt文件,文件中的格式如下:
# 每一行为token和其对应的label,并且两者之间用space隔开,两句话之间用回车隔开。
北京 B-LOC
は O
中国 B-LOC
の O
首都 O
です O
。 O

今日 O
は O
...
  1. 然后下载三个所需文件
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
wget "https://raw.githubusercontent.com/huggingface/transformers/master/examples/ner/run_ner.py"
wget "https://raw.githubusercontent.com/huggingface/transformers/master/examples/ner/utils_ner.py"

其中,preprocess.py为数据预处理文件,其功能是:1)过滤一些特殊字符(GermEval 2014 dataset中存在的,一般我们自己制作的数据不存在), 2)根据max_len对长句子进行分割。

  1. 设置变量
export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased
  1. 对数据预处理
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
  1. 把所有label存入文件中
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
# 生成的labels.txt文件中包含所有unique的labels

环境变量的设置

export OUTPUT_DIR=germeval-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=1

开始训练

python3 run_ner.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict

解析run_ner.py

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """


import logging
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple

import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score #直接提供预测的和正确的两个list就能返回对应的f1,p,r.
from torch import nn

from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    EvalPrediction,
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    set_seed,
)
from utils_ner import NerDataset, Split, get_labels


logger = logging.getLogger(__name__)


@dataclass
class ModelArguments: 
    """
    设置模型相关参数,主要是预训练模型的路径/名称,tokenizer的名称如果不提供则与模型同名.
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
    # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
    # or just modify its tokenizer_config.json.
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )


@dataclass
class DataTrainingArguments:
    """
    设置输入数据相关参数,包括.txt文件位置, label.txt文件位置, max_seq_length. 
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    data_dir: str = field(
        metadata={"help": "The input data dir. Should contain the .txt files for a CoNLL-2003-formatted task."}
    )
    labels: Optional[str] = field(
        metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."}
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )


def main():
    # 模型训练相关参数的名称在training_args.py中可见(非常详细)
    # See all possible arguments in src/transformers/training_args.py
    # or by passing the --help flag to this script.
    # We now keep distinct sets of args, for a cleaner separation of concerns.

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) # 生成参数解析器
    model_args, data_args, training_args = parser.parse_args_into_dataclasses() # 解析3种类型的参数
    
    # 检测输出文件是否存在,并且保证为空文件夹或者设置overwrite_output_dir == True
    if (
        os.path.exists(training_args.output_dir)
        and os.listdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        raise ValueError(
            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
        )

    # 设置logging的级别(Warning或者Information级别以上把信息输出在命令行里)
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        training_args.local_rank,
        training_args.device,
        training_args.n_gpu,
        bool(training_args.local_rank != -1),
        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed
    set_seed(training_args.seed)

    # Prepare CONLL-2003 task
    labels = get_labels(data_args.labels) # 生成['B-ORG','I-ORG',....]这样的包含所有unique labels的list.
    label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)} # 生成id2label的字典
    num_labels = len(labels)

    # 加载预训练模型,日语的模型(如东北大的)需要对from_pretained进行更改, 不使用AutoConfig, 且tokenizer以及模型的加载分别为
    # BertJapaneseTokenizer.from_pretrained()以及BertForTokenClassification.from_pretrained() 注意import的时候也要有所更改
    # Load pretrained model and tokenizer
    # Distributed training:
    # The .from_pretrained methods guarantee that only one local process can concurrently
    # download model & vocab.

    config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        num_labels=num_labels,
        id2label=label_map,
        label2id={label: i for i, label in enumerate(labels)},
        cache_dir=model_args.cache_dir,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast,
    )
    model = AutoModelForTokenClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
    )

    # 这里用NerDataset类来加载训练集和验证集
    # NerDataset这个类对tokens生成相应的labels如下(详细见utils_ner.py):
    # tokens:   [CLS]               is this jack          ##son             ##ville       ?        [SEP]
    # labels:   [pad_ken_label_id] 'O' 'O' 'B-LOC' [pad_ken_label_id] [pad_ken_label_id] 'O' [pad_ken_label_id] 
    # 即特殊字符如[CLS]、[SEP],以及一个token中被tokenized的所有tokens中第一位以外的token(##son,##ville)所对应的label为[pad_ken_label_id],后面进行loss计算时也会忽略这些token
    train_dataset = (
        NerDataset(
            data_dir=data_args.data_dir,
            tokenizer=tokenizer,
            labels=labels,
            model_type=config.model_type,
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train, # 返回'train'这个字符串
            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
    )
    eval_dataset = (
        NerDataset(
            data_dir=data_args.data_dir,
            tokenizer=tokenizer,
            labels=labels,
            model_type=config.model_type,
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
    )
    
    # 这个函数主要是把模型输出的预测label sequence和真实的label sequence中不等于ignore_index (-100) 的label index抽取出来
    def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
        preds = np.argmax(predictions, axis=2)

        batch_size, seq_len = preds.shape

        out_label_list = [[] for _ in range(batch_size)]
        preds_list = [[] for _ in range(batch_size)]

        for i in range(batch_size):
            for j in range(seq_len):
                if label_ids[i, j] != nn.CrossEntropyLoss().ignore_index:
                    out_label_list[i].append(label_map[label_ids[i][j]])
                    preds_list[i].append(label_map[preds[i][j]])

        return preds_list, out_label_list

    #此函数主要是对模型的结果进行评价(f1,p,r), 采用的方法是把所有生成sequences合并成一个长list, 相应正确sequences也合并成长list, 然后对这两个list计算三个评价指标
    def compute_metrics(p: EvalPrediction) -> Dict:
        preds_list, out_label_list = align_predictions(p.predictions, p.label_ids)
        return {
            "precision": precision_score(out_label_list, preds_list),
            "recall": recall_score(out_label_list, preds_list),
            "f1": f1_score(out_label_list, preds_list),
        }

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

    # Training
    if training_args.do_train:
        trainer.train(
            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        trainer.save_model()
        # For convenience, we also re-save the tokenizer to the same directory,
        # so that you can share your model easily on huggingface.co/models =)
        if trainer.is_world_master():
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
    results = {}
    if training_args.do_eval and training_args.local_rank in [-1, 0]:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            for key, value in result.items():
                logger.info("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))

            results.update(result)

    # Predict
    if training_args.do_predict and training_args.local_rank in [-1, 0]:
        test_dataset = NerDataset(
            data_dir=data_args.data_dir,
            tokenizer=tokenizer,
            labels=labels,
            model_type=config.model_type,
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.test,
            local_rank=training_args.local_rank,
        )

        predictions, label_ids, metrics = trainer.predict(test_dataset)
        preds_list, _ = align_predictions(predictions, label_ids)

        output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt") #打印f1,p,r数值
        with open(output_test_results_file, "w") as writer:
            for key, value in metrics.items():
                logger.info("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))

        # Save predictions
        output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
        with open(output_test_predictions_file, "w") as writer:
            with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
                example_id = 0 # 表示当前例子(句子)的id
                for line in f:
                    if line.startswith("-DOCSTART-") or line == "" or line == "\n": #是否要到下一句了
                        writer.write(line)
                        if not preds_list[example_id]: # 先看下面,由于每次都从预测的label list中pop第一个元素,当全部label都pop完之后
                            example_id += 1            # 在到下一句之前,preds_list[example_id]等于[],以此作为判定条件把例子(句子)id加上1
                    elif preds_list[example_id]:
                        output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
                        writer.write(output_line)
                    else:
                        logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])

    return results


if __name__ == "__main__":
    main()

关于输入自定义格式的数据的小记

  • 在很多信息抽取的任务中,我们需要指定bert的特定位置的输出之间进行计算,例如在做事件抽出或者实体关系抽出的任务时,需要对每个example定位实体的位置,然后对相应位置的bert输出向量进行一些处理,如指定两位置实体之间的关系判定时,需要把这两个位置的输出concat之后输入到关系分类器中。
  • 因此,每个example中各类别的实体的位置信息,以及原本实体的index和被tokenize之后的位置之间的对应位置词典是必要的。但是对于默认的输入只有input_ids, token_type_ids以及labels等,而装有位置信息的list以及装有对应关系的dict应该如何传入给模型呢?
  • 其实主要是更改两个地方,一个是要自定义Dataset,读取输入的特征时加入entity_locations和old_new_index_dict等features。
  • 另一个是要创建一个自定义的继承DataCollator的类,因为在huggingface的代码中,DataLoader加载Dataset的时候,其实是DataCollator在指定加载什么形状的输入给DataLoader,因此我们创建的继承DataCollator的类要指定一个batch的数据中有哪几种数据,每种数据是什么形状。最后就会生成一个{str:Any}的字典,Any既可以是一个batch的tensor(形状为(b_s,max_len,hidden_dim)),也可以是指定的数据类型如一个batch的实体位置(list of lists)或者一个batch的实体index对应关系字典(list of dicts)。
  • 这样,DataLoader每次会输出一个这样的字典input,然后通过model(**input)把他们根据名称传给model,然后model拿到这些数据再进一步操作。