Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/ko/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@
title: (λ²ˆμ—­μ€‘) Backbones
- local: in_translation
title: (λ²ˆμ—­μ€‘) Feature extractors
- local: in_translation
title: (λ²ˆμ—­μ€‘) Processors
- local: main_classes/processors
title: ν”„λ‘œμ„Έμ„œ
- local: tokenizer_summary
title: ν† ν¬λ‚˜μ΄μ € μš”μ•½
- local: pad_truncation
Expand Down
145 changes: 145 additions & 0 deletions docs/source/ko/main_classes/processors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# ν”„λ‘œμ„Έμ„œ[[processors]]

ν”„λ‘œμ„Έμ„œλŠ” Transformers λΌμ΄λΈŒλŸ¬λ¦¬μ—μ„œ 두 κ°€μ§€ λ‹€λ₯Έ 의미λ₯Ό κ°€μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€:
- [Wav2Vec2](../model_doc/wav2vec2)(μŒμ„±κ³Ό ν…μŠ€νŠΈ) λ˜λŠ” [CLIP](../model_doc/clip)(ν…μŠ€νŠΈμ™€ λΉ„μ „)κ³Ό 같은 λ©€ν‹°λͺ¨λ‹¬ λͺ¨λΈμ„ μœ„ν•œ μž…λ ₯을 μ „μ²˜λ¦¬ν•˜λŠ” 객체
- GLUEλ‚˜ SQUADλ₯Ό μœ„ν•œ 데이터 μ „μ²˜λ¦¬μ— μ‚¬μš©λ˜μ—ˆλ˜ 라이브러리의 이전 λ²„μ „μ—μ„œ μ‚¬μš©λœ 더 이상 μ‚¬μš©λ˜μ§€ μ•ŠλŠ” 객체

## λ©€ν‹°λͺ¨λ‹¬ ν”„λ‘œμ„Έμ„œ[[transformers.ProcessorMixin]]

λͺ¨λ“  λ©€ν‹°λͺ¨λ‹¬ λͺ¨λΈμ€ μ—¬λŸ¬ λͺ¨λ‹¬λ¦¬ν‹°(ν…μŠ€νŠΈ, λΉ„μ „, μ˜€λ””μ˜€ λ“±)의 데이터λ₯Ό μΈμ½”λ”©ν•˜κ±°λ‚˜ λ””μ½”λ”©ν•˜λŠ” 객체가 ν•„μš”ν•©λ‹ˆλ‹€. μ΄λŠ” ν† ν¬λ‚˜μ΄μ €(ν…μŠ€νŠΈ λͺ¨λ‹¬λ¦¬ν‹°μš©), 이미지 ν”„λ‘œμ„Έμ„œ(λΉ„μ „μš©), νŠΉμ„± μΆ”μΆœκΈ°(μ˜€λ””μ˜€μš©)와 같은 두 개 μ΄μƒμ˜ 처리 객체λ₯Ό κ·Έλ£Ήν™”ν•˜λŠ” ν”„λ‘œμ„Έμ„œλΌλŠ” 객체둜 μ²˜λ¦¬λ©λ‹ˆλ‹€.

μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ“€μ€ μ €μž₯ 및 λ‘œλ”© κΈ°λŠ₯을 κ΅¬ν˜„ν•˜λŠ” λ‹€μŒ κΈ°λ³Έ 클래슀λ₯Ό μƒμ†λ°›μŠ΅λ‹ˆλ‹€:

[[autodoc]] ProcessorMixin

## 더 이상 μ‚¬μš©λ˜μ§€ μ•ŠλŠ” ν”„λ‘œμ„Έμ„œ[[transformers.DataProcessor]]

λͺ¨λ“  ν”„λ‘œμ„Έμ„œλŠ” [`~data.processors.utils.DataProcessor`]와 λ™μΌν•œ μ•„ν‚€ν…μ²˜λ₯Ό λ”°λ¦…λ‹ˆλ‹€. ν”„λ‘œμ„Έμ„œλŠ” [`~data.processors.utils.InputExample`]의 λͺ©λ‘μ„ λ°˜ν™˜ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ [`~data.processors.utils.InputExample`]은 λͺ¨λΈμ— μž…λ ₯ν•˜κΈ° μœ„ν•΄ [`~data.processors.utils.InputFeatures`]둜 λ³€ν™˜λ  수 μžˆμŠ΅λ‹ˆλ‹€.

[[autodoc]] data.processors.utils.DataProcessor

[[autodoc]] data.processors.utils.InputExample

[[autodoc]] data.processors.utils.InputFeatures

## GLUE[[transformers.glue_convert_examples_to_features]]

[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/)λŠ” 기쑴의 λ‹€μ–‘ν•œ NLU μž‘μ—…μ—μ„œ λͺ¨λΈμ˜ μ„±λŠ₯을 ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. μ΄λŠ” λ…Όλ¬Έ [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7)κ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이 λΌμ΄λΈŒλŸ¬λ¦¬λŠ” MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE, WNLI와 같은 μž‘μ—…μ— λŒ€ν•΄ 총 10개의 ν”„λ‘œμ„Έμ„œλ₯Ό ν˜ΈμŠ€νŒ…ν•©λ‹ˆλ‹€.

ν•΄λ‹Ή ν”„λ‘œμ„Έμ„œλ“€μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

- [`~data.processors.utils.MrpcProcessor`]
- [`~data.processors.utils.MnliProcessor`]
- [`~data.processors.utils.MnliMismatchedProcessor`]
- [`~data.processors.utils.Sst2Processor`]
- [`~data.processors.utils.StsbProcessor`]
- [`~data.processors.utils.QqpProcessor`]
- [`~data.processors.utils.QnliProcessor`]
- [`~data.processors.utils.RteProcessor`]
- [`~data.processors.utils.WnliProcessor`]

λ˜ν•œ, λ‹€μŒ λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터 νŒŒμΌμ—μ„œ 값을 λ‘œλ“œν•˜κ³  [`~data.processors.utils.InputExample`] λͺ©λ‘μœΌλ‘œ λ³€ν™˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

[[autodoc]] data.processors.glue.glue_convert_examples_to_features


## XNLI[[xnli]]

[The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/)λŠ” ꡐ차 μ–Έμ–΄ ν…μŠ€νŠΈ ν‘œν˜„μ˜ ν’ˆμ§ˆμ„ ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. XNLIλŠ” [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/)λ₯Ό 기반으둜 ν•œ ν¬λΌμš°λ“œμ†Œμ‹± λ°μ΄ν„°μ…‹μž…λ‹ˆλ‹€: ν…μŠ€νŠΈ μŒμ€ 15개의 λ‹€λ₯Έ μ–Έμ–΄(μ˜μ–΄μ™€ 같은 κ³ μžμ› 언어와 μŠ€μ™€νžλ¦¬μ–΄μ™€ 같은 μ €μžμ› μ–Έμ–΄ λͺ¨λ‘ 포함)에 λŒ€ν•΄ ν…μŠ€νŠΈ ν•¨μ˜ μ£Όμ„μœΌλ‘œ λ ˆμ΄λΈ”λ§λ©λ‹ˆλ‹€.

μ΄λŠ” λ…Όλ¬Έ [XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053)κ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이 λΌμ΄λΈŒλŸ¬λ¦¬λŠ” XNLI 데이터λ₯Ό λ‘œλ“œν•˜λŠ” ν”„λ‘œμ„Έμ„œλ₯Ό ν˜ΈμŠ€νŒ…ν•©λ‹ˆλ‹€:

- [`~data.processors.utils.XnliProcessor`]

μ •λ‹΅ λ ˆμ΄λΈ”μ΄ ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ μ‚¬μš© κ°€λŠ₯ν•˜λ―€λ‘œ, ν‰κ°€λŠ” ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ μˆ˜ν–‰λ©λ‹ˆλ‹€.

μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό μ‚¬μš©ν•˜λŠ” μ˜ˆμ œλŠ” [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) μŠ€ν¬λ¦½νŠΈμ—μ„œ μ œκ³΅λ©λ‹ˆλ‹€.


## SQuAD[[squad]]

[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//)λŠ” μ§ˆμ˜μ‘λ‹΅μ—μ„œ λͺ¨λΈμ˜ μ„±λŠ₯을 ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. v1.1κ³Ό v2.0 두 κ°€μ§€ 버전이 μ‚¬μš© κ°€λŠ₯ν•©λ‹ˆλ‹€. 첫 번째 버전(v1.1)은 λ…Όλ¬Έ [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250)κ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€. 두 번째 버전(v2.0)은 λ…Όλ¬Έ [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822)와 ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이 λΌμ΄λΈŒλŸ¬λ¦¬λŠ” 두 버전 각각에 λŒ€ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό ν˜ΈμŠ€νŒ…ν•©λ‹ˆλ‹€:

### ν”„λ‘œμ„Έμ„œ[[transformers.data.processors.squad.SquadProcessor]]

ν•΄λ‹Ή ν”„λ‘œμ„Έμ„œλ“€μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

- [`~data.processors.utils.SquadV1Processor`]
- [`~data.processors.utils.SquadV2Processor`]

λ‘˜ λ‹€ 좔상 클래슀 [`~data.processors.utils.SquadProcessor`]λ₯Ό μƒμ†λ°›μŠ΅λ‹ˆλ‹€.

[[autodoc]] data.processors.squad.SquadProcessor
- all

λ˜ν•œ, λ‹€μŒ λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ SQuAD 예제λ₯Ό λͺ¨λΈ μž…λ ₯으둜 μ‚¬μš©ν•  수 μžˆλŠ” [`~data.processors.utils.SquadFeatures`]둜 λ³€ν™˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

[[autodoc]] data.processors.squad.squad_convert_examples_to_features


μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ“€κ³Ό μ•žμ„œ μ–ΈκΈ‰ν•œ λ©”μ„œλ“œλŠ” 데이터λ₯Ό ν¬ν•¨ν•˜λŠ” 파일뿐만 μ•„λ‹ˆλΌ *tensorflow_datasets* νŒ¨ν‚€μ§€μ™€λ„ ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ•„λž˜μ— μ˜ˆμ œκ°€ μ œκ³΅λ©λ‹ˆλ‹€.


### μ‚¬μš© μ˜ˆμ‹œ[[example-usage]]

λ‹€μŒμ€ 데이터 νŒŒμΌμ„ μ‚¬μš©ν•˜μ—¬ ν”„λ‘œμ„Έμ„œμ™€ λ³€ν™˜ λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜λŠ” μ˜ˆμ œμž…λ‹ˆλ‹€:

```python
# V2 ν”„λ‘œμ„Έμ„œ λ‘œλ”©
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)

# V1 ν”„λ‘œμ„Έμ„œ λ‘œλ”©
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)

features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
```

*tensorflow_datasets* μ‚¬μš©μ€ 데이터 파일 μ‚¬μš©λ§ŒνΌ μ‰½μŠ΅λ‹ˆλ‹€:

```python
# tensorflow_datasetsλŠ” Squad V1만 μ²˜λ¦¬ν•©λ‹ˆλ‹€.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
```

μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό μ‚¬μš©ν•˜λŠ” 또 λ‹€λ₯Έ μ˜ˆμ œλŠ” [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) μŠ€ν¬λ¦½νŠΈμ—μ„œ μ œκ³΅λ©λ‹ˆλ‹€.