HugginFace系列:transformers库

LeoJeshua Lv2

transformers教程

NLP-Course

涉及多个库:
This course will teach you about natural language processing (NLP) using libraries from the Hugging Face ecosystem

  • 🤗 transformers
  • 🤗 datasets
  • 🤗 tokenizers
  • 🤗 accelerate
  • as well as the Hugging Face Hub

课程大纲:

  • Introduction: Chapter 1-4(transformers库入门)
    • Transformer Models 的工作原理
    • 如何通过 transformers库 加载和使用 Hugging Face Hub 中的预训练模型
    • 微调预训练模型
    • 在Hub上分享模型
  • Diving in: Chapter 5-8(深入学习)
    • datasets库的使用
    • tokenizers库的使用
    • 主要的NLP任务
    • 如何寻求帮助
  • Advanced: Chapter 9-12(NLP之外)
    • 在Hub上构建和分享demos
    • Transformers can hear
    • Transformers can see
    • 针对production优化

0.Setup

安装transformers库:

1
pip install transformers

验证安装:

1
2
import transformers
print(transformers.__version__)

1.Transformer Models

1.1 NLP概念和任务

1.2 transformers库

使用pipeline加载预训练模型解决各种NLP任务

Sentiment Analysis

1
2
3
4
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
1
>>> [{'label': 'POSITIVE', 'score': 0.9598047137260437}]

Zero-shot Classification

1
2
3
4
5
6
7
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
1
2
3
>>> {'sequence': 'This is a course about the Transformers library',
'labels': ['education', 'business', 'politics'],
'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]}

Text Gengeration

1
2
3
4
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
1
2
3
4
5
>>> [{'generated_text': 'In this course, we will teach you how to understand and use '
'data flow and data interchange when handling user data. We '
'will be working with one or more of the most commonly used '
'data flows — data flows of various types, as seen by the '
'HTTP'}]

Named Entity Recognition (NER)

Return the words representing persons, organizations or locations.

1
2
3
4
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
1
2
3
4
>>> [{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18}, 
{'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45},
{'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57}
]

Fill Mask

1
2
3
4
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
1
2
3
4
5
6
7
8
>>> [{'sequence': 'This course will teach you all about mathematical models.',
'score': 0.19619831442832947,
'token': 30412,
'token_str': ' mathematical'},
{'sequence': 'This course will teach you all about computational models.',
'score': 0.04052725434303284,
'token': 38163,
'token_str': ' computational'}]

Question Answering

1
2
3
4
5
6
7
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
question="Where do I work?",
context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
1
>>> {'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

Summarization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
"""
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.

Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
"""
)
1
2
3
4
5
6
7
>>> [{'summary_text': ' America has changed dramatically during recent years . The '
'number of engineering graduates in the U.S. has declined in '
'traditional engineering disciplines such as mechanical, civil '
', electrical, chemical, and aeronautical engineering . Rapidly '
'developing economies such as China and India, as well as other '
'industrial countries in Europe and Asia, continue to encourage '
'and advance engineering .'}]

Translation

1
2
3
4
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")
1
>>> [{'translation_text': 'This course is produced by Hugging Face.'}]

1.3 Transformers架构

We discussed how Transformer models work at a high level, and talked about the importance of transfer learning and fine-tuning.

A key aspect is that you can use the full architecture or only the encoder or only the decoder, depending on what kind of task you aim to solve. The following table summarizes this:

Model Examples Tasks
Encoder ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa Sentence classification, named entity recognition, extractive question answering
Decoder CTRL, GPT, GPT-2, Transformer XL Text generation
Encoder-Decoder BART, T5, Marian, mBART Summarization, translation, generative question answering

Encoder

  • Bi-directional: context from the left, and the right | 双向注意力
  • auto-encoding | 自编码的
  • NLU (Natural Language Understanding) | 负责 “理解”
  • eg: BERT

Decoder

  • Unidirectional: access to their left (or right!) | 单向注意力
  • auto-regressive | 自回归的
  • NLG (Natural Language Generation) | 负责 “生成”
  • eg: GPT

Encoder-Decoder

  • seq2seq
  • eg: T5, BART
  • features:
    • 【权重】 不共享 (Weights are not necessarily shared across the encoder and decoder)
    • 输入和输出的 【分布】 不同 (Input distribution different from Output distribution)
    • 输入和输出的 【长度】 是独立的 (Output length is independent of input length in encoder-decoder models)
    • 【Seq2Seq模型】 可以由各种 Encoder 和 Decoder 自由组合

1.4 Bias and Limitations

数据的偏见和局限性 -> (导致)模型的偏见和局限性

Quiz

  • Language Model的预训练通常是 self-supervised
  • Text Summarization 适合使用 Seq2Seq Model
  • 可能导致模型偏见的因素:数据的偏见、预训练模型的偏见、评估指标的问题 等

2.Using 🤗 transformers

正如您在第 1 章中看到的,Transformer 模型通常非常大。由于有数百万到数百亿个参数,训练和部署这些模型是一项复杂的任务。此外,由于几乎每天都会发布新模型,并且每个模型都有自己的实现,因此尝试所有模型并非易事。

🤗 Transformers 库就是为了解决这个问题而创建的。它的目标是提供一个 API,通过该 API 可以加载、训练和保存任何 Transformer 模型。该库的主要功能包括:

  • 易于使用:只需两行代码即可下载、加载和使用最先进的 NLP 模型进行推理。
  • 灵活性:从本质上讲,所有模型都是简单的 PyTorchnn.Module或 TensorFlowtf.keras.Model类,可以像各自机器学习 (ML) 框架中的任何其他模型一样进行处理。
  • 简单性:库中几乎没有任何抽象。“全部在一个文件中”是一个核心概念:模型的前向传递完全在单个文件中定义,因此代码本身是可理解和可破解的。

最后一个功能让 🤗 Transformers 与其他 ML 库截然不同。这些模型不是基于跨文件共享的模块构建的;相反,每个模型都有自己的层。除了使模型更平易近人和易于理解之外,这还允许您轻松地在一个模型上进行实验而不会影响其他模型。

本章将以一个端到端的示例开始,其中我们一起使用模型和标记器来复制 第1章 pipeline() 中介绍的功能。接下来,我们将讨论模型 API:我们将深入研究模型和配置类,并向您展示如何加载模型以及它如何处理数字输入以输出预测。

然后,我们将介绍 tokenizer API,它是该函数的另一个主要组件pipeline()。Tokenizer 负责第一个和最后一个处理步骤,处理从文本到神经网络的数字输入的转换,并在需要时转换回文本。最后,我们将向您展示如何通过准备好的批处理模型发送多个句子,然后通过仔细查看高级函数 tokenizer() 来总结这一切。

Behind the pipeline

  • Title: HugginFace系列:transformers库
  • Author: LeoJeshua
  • Created at : 2025-02-14 12:38:14
  • Updated at : 2025-03-10 20:22:55
  • Link: https://leojeshua.github.io/HuggingFace/HuggingFace-Transformers/
  • License: This work is licensed under CC BY-NC-SA 4.0.