2020-12-07发表2021-01-06更新4 分钟读完 (大约637个字)

pytorch cheat_list

pytorch 操作小计

torch==1.7.0

tensor

torch.stack

将List[tensor]变成tensor。torch.stack(tensors,dim=0,out=None)Concatenates a sequence of tensors along a new dimension.

1 2	b = torch.randn(4) a = torch.stack([b, b], dim = 0) # a.shape = (2,4)

torch.gather

torch.gather(input, dim, index, *, sparse_grad=False, out=None) → Tensor

英文解释为Gathers values along an axis specified by dim.

看到比较有道理的应用场景是，在变长序列中gather到最后一个或者说倒数第几个元素。一般变长序列为inputs = [[1,2,3,0,0], [2,3,4,5,0]]。这时候想获得最后一个元素就可以。需要注意的点是，输出的tensor和index是相同shape的。

inputs = torch.tensor([[1,2,3,0,0],
         	[2,3,4,5,0]])
index = torch.tensor([[2], [3]], dtype=torch.long)
last_inputs = torch.gather(inputs, dim=1, index)
"""tensor([[3],
           [5]])"""

torch.expand

将tensor扩展维度，自动复制，十分好用。

a = torch.randint(1, 5, size=(2,3))
#tensor([[3, 1, 1],
#        [1, 2, 4]])
a = a.unsqueeze(2).expand(2,3,3)
"""
tensor([[[3, 3, 3],
         [1, 1, 1],
         [1, 1, 1]],

        [[1, 1, 1],
         [2, 2, 2],
         [4, 4, 4]]])"""

torch.repeat

repeat(*sizes) -> Tensor，重复复制tensor在指定的维度上。其实有点类似于广播操作了？

1
2
3

x = torch.tensor([1, 2, 3]) # x.shape=[3]
x.repeat(4,2) # x.shape=[4,6]
x.repeat(4,2,1) # x.shape=[4,2,3]

torch.nn.functional

F.softmax

F.softmax(Tensor, dim=None) 对于多维度矩阵就是 einsum(‘ijk -> jk’, a) = torch.ones(a.shape[1:])

1
2
3

import torch.nn.functional as F
a = torch.randn(4,5)
a = F.softmax(a, dim = 0)

torch.nn

之前一直依赖着huggingface的模型加载from_pretrained，但其实在一般任务场景下，使用torch.load的时候会更多，所以记录一下torch.load方法的使用场景。

torch.load & torch.save

一般我们将模型的参数保存，而不会去保存整个模型的结构。这里如果需要部分加载参数，可以使用strict=False。这里需要注意加载的是字典dict，不是模型。

#model ... after training
torch.save(model.state_dict(), cached_file_path)
model_state = torch.load(cached_file_path)
model.load_state_dict(model_state, strict=False)

奇淫技巧

whole word mask

在bert或者其他语言模型中，对一段文本需要先进行tokenize分词操作，而分英文单词的时候，由于OOV问题，会将有些word分成token级别的，比如将trying分成try,##ing。而我们比如在建图或者以word为粒度的时候，就需要将token的输出平均给word了。那么如何操作呢？

encoder_output = encoder_outputs[i]  # [slen, bert_hidden_size]
word_num = 123
word_index=(torch.arange(word_num) + 1).unsqueeze(1).expand(-1, slen)  # [mention_num, slen]
words = pos_id[i].unsqueeze(0).expand(mention_num, -1)  # [mention_num, slen]
select_metrix = (mention_index == mentions).float()  # [mention_num, slen]
# average word -> mention
word_total_numbers = torch.sum(select_metrix, dim=-1).unsqueeze(-1).expand(-1, slen)  # [mention_num, slen]
select_metrix = torch.where(word_total_numbers > 0, select_metrix / word_total_numbers, select_metrix)
x = torch.mm(select_metrix, encoder_output)

2020-11-18发表2020-12-04更新论解9 分钟读完 (大约1379个字)

ALBERT 更小但是更慢？

最近由于参加阅读理解比赛，所以大量测试各种模型，惊奇地发现原本现在阅读理解比赛中SOTA的模型居然是不起眼并且以小模型闻名的ALBERT。这让我对这个“小”模型产生了好奇。从而写一下这份的论文笔记。

摘要

模型越大下游效果越强是众所周知的道理，但是由于硬件设备和显存所限，所以模型不能无限制得放大。这篇文章提出了一个全面领先BERT模型的ALBERT，在比BERT-LARGE参数小的情况下超过了它。

有何区别

1. embedding 参数减少

在从one-hot embedding到hidden size embedding有一个$V \times H$的全连接层，这里使用了一个trick，加了一个hidden layer，从而使得全连接层变成了$V\times E + E\times H$。这样子我们就可以用一个很大的$H$了，比如在xxlarge上就是$H=4096$。

2.层间参数共享

很简单，就是原来模型类似于$F(x) = f_n(f_{n-1}(…f_1(x)))$，但是现在变成了$F(x)=f(f(…f(x)))$。我也在想，虽然$f(x)$是一个非线性的，但这种形式是不是可以有函数去拟合$F(x)$，毕竟重复$f(x)$这不能优化吗？去压缩ALBERT模型的大小。

3. SOP

提出了一个新的self supervised learning 的 objective，既SOP(sentence ordering objectives)。类似于BERT预测两个句子是否是连续的，ALBERT需要预测打乱句子的顺序。

并在在对比中，SOP对于RACE也就是阅读理解任务提高了2.3个点，很哇塞

实验

实验部分具体暂且不表。我理解的有几点

额外的领域内预训练是有益的，但是领域外可能会有害
dropout在模型不会over-fit的情况下其实可以忽略，在batch normalization和dropout可能会损害模型的性能。
hidden size 4096 有可能是ALBERT 性能强的主要原因。
虽然层间参数共享，理论上可以无限深，但是实验发现24层并没有12层效果好。特别宽也没有特别好，这都是玄学调参，很难人工判断。
按理来说fffff(x) 可能会导致每层之间的输出过于相似，但在这里实验发现，并没有。难道是embed layer就很强了？猜测

改进点

稀疏矩阵优化，attention 魔改
SOP是否可以泛用。

模型解读

主类

forward先经过embeddings层再经过encoder层。这里注意，默认输入是用了最后一个隐层所有token的输出再经过一个线性+tanh的操作。

class AlbertModel(AlbertPreTrainedModel):
    config_class = AlbertConfig
    load_tf_weights = load_tf_weights_in_albert
    base_model_prefix = "albert"
    def __init__(self, config, add_pooling_layer=True):
        super().__init__(config)

        self.config = config
        self.embeddings = AlbertEmbeddings(config)
        self.encoder = AlbertTransformer(config)
        if add_pooling_layer:
            self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
            self.pooler_activation = nn.Tanh()
        else:
            self.pooler = None
            self.pooler_activation = None

        self.init_weights()

    def get_input_embeddings(self):
        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value):
        self.embeddings.word_embeddings = value

    def _resize_token_embeddings(self, new_num_tokens):
        old_embeddings = self.embeddings.word_embeddings
        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
        self.embeddings.word_embeddings = new_embeddings
        return self.embeddings.word_embeddings

    def _prune_heads(self, heads_to_prune):
        """Prunes heads of the model.
        heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
        ALBERT has a different architecture in that its layers are shared across groups, which then has inner groups.
        If an ALBERT model has 12 hidden layers and 2 hidden groups, with two inner groups, there
        is a total of 4 different layers.

        These layers are flattened: the indices [0,1] correspond to the two inner groups of the first hidden layer,
        while [2,3] correspond to the two inner groups of the second hidden layer.

        Any layer with in index other than [0,1,2,3] will result in an error.
        See base class PreTrainedModel for more information about head pruning
        """
        for layer, heads in heads_to_prune.items():
            group_idx = int(layer / self.config.inner_group_num)
            inner_group_idx = int(layer - group_idx * self.config.inner_group_num)
            self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)

    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="albert-base-v2",
        output_type=BaseModelOutputWithPooling,
        config_class=_CONFIG_FOR_DOC,
    )
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        device = input_ids.device if input_ids is not None else inputs_embeds.device

        if attention_mask is None:
            attention_mask = torch.ones(input_shape, device=device)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output = self.embeddings(
            input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
        )
        encoder_outputs = self.encoder(
            embedding_output,
            extended_attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = encoder_outputs[0]

        pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0])) if self.pooler is not None else None

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

pytorch cheat_list

pytorch 操作小计

tensor

torch.stack

torch.gather

torch.expand

torch.repeat

torch.nn.functional

F.softmax

torch.nn

torch.load & torch.save

奇淫技巧

whole word mask

ALBERT 更小但是更慢？

ALBERT 更小但是更慢？

摘要

有何区别

1. embedding 参数减少

2.层间参数共享

3. SOP

实验

改进点

模型解读

主类

链接

分类

最新文章

归档

标签

订阅更新