Skip to content

Commit f14a3ee

Browse files
pco111Rocketknight1
authored andcommitted
Revert changes in docstring
1 parent f39d67a commit f14a3ee

File tree

1 file changed

+50
-65
lines changed

1 file changed

+50
-65
lines changed

src/transformers/tokenization_utils_base.py

Lines changed: 50 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -3267,22 +3267,41 @@ def pad(
32673267
verbose: bool = True,
32683268
) -> BatchEncoding:
32693269
"""
3270-
Pad a single encoded input or a batch of encoded inputs up to the maximum length of the batch or up to a
3271-
given maximum length. Padding side can be specified on the left or on the right.
3270+
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
3271+
in the batch.
3272+
3273+
Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
3274+
`self.pad_token_id` and `self.pad_token_type_id`).
3275+
3276+
Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
3277+
text followed by a call to the `pad` method to get a padded encoding.
3278+
3279+
<Tip>
3280+
3281+
If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
3282+
result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
3283+
PyTorch tensors, you will lose the specific device of your tensors however.
3284+
3285+
</Tip>
32723286
32733287
Args:
3274-
encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]]` or `list[dict[str, list[int]]]`):
3275-
Tokenized inputs. Can be a single batch encoding, a list of batch encodings, a dictionary of entries
3276-
produced by a `tokenizer.encode_plus` or a list of dictionaries from a `tokenizer.batch_encode_plus`.
3288+
encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]` or `list[dict[str, list[int]]]`):
3289+
Tokenized inputs. Can represent one input ([`BatchEncoding`] or `dict[str, list[int]]`) or a batch of
3290+
tokenized inputs (list of [`BatchEncoding`], *dict[str, list[list[int]]]* or *list[dict[str,
3291+
list[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
3292+
collate function.
3293+
3294+
Instead of `list[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see
3295+
the note above for the return type.
32773296
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
32783297
Select a strategy to pad the returned sequences (according to the model's padding side and padding
32793298
index) among:
32803299
3281-
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
3300+
- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
32823301
sequence if provided).
32833302
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
32843303
acceptable input length for the model if that argument is not provided.
3285-
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
3304+
- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different
32863305
lengths).
32873306
max_length (`int`, *optional*):
32883307
Maximum length of the returned list and optionally padding length (see above).
@@ -3292,18 +3311,19 @@ def pad(
32923311
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
32933312
`>= 7.5` (Volta).
32943313
padding_side (`str`, *optional*):
3295-
'right' or 'left'. If get_ येणार from the tokenizer, the `tokenizer.padding_side` will be used.
3314+
The side on which the model should have padding applied. Should be selected between ['right', 'left'].
3315+
Default value is picked from the class attribute of the same name.
32963316
return_attention_mask (`bool`, *optional*):
32973317
Whether to return the attention mask. If left to the default, will return the attention mask according
3298-
to the specific tokenizer's default.
3318+
to the specific tokenizer's default, defined by the `return_outputs` attribute.
32993319
33003320
[What are attention masks?](../glossary#attention-mask)
33013321
return_tensors (`str` or [`~utils.TensorType`], *optional*):
33023322
If set, will return tensors instead of list of python integers. Acceptable values are:
33033323
3304-
- `'tf'`: Return TensorFlow `tf.Tensor` objects.
3324+
- `'tf'`: Return TensorFlow `tf.constant` objects.
33053325
- `'pt'`: Return PyTorch `torch.Tensor` objects.
3306-
- `'np'`: Return NumPy `np.ndarray` objects.
3326+
- `'np'`: Return Numpy `np.ndarray` objects.
33073327
verbose (`bool`, *optional*, defaults to `True`):
33083328
Whether or not to print more information and warnings.
33093329
"""
@@ -3724,65 +3744,30 @@ def _pad(
37243744
return_attention_mask: Optional[bool] = None,
37253745
) -> dict:
37263746
"""
3727-
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
3728-
in the batch.
3729-
3730-
Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
3731-
`self.pad_token_id` and `self.pad_token_type_id`).
3732-
3733-
Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
3734-
text followed by a call to the `pad` method to get a padded encoding.
3735-
3736-
<Tip>
3737-
3738-
If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
3739-
result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
3740-
PyTorch tensors, you will lose the specific device of your tensors however.
3741-
3742-
</Tip>
3747+
Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
37433748
37443749
Args:
3745-
encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]` or `list[dict[str, list[int]]]`):
3746-
Tokenized inputs. Can represent one input ([`BatchEncoding`] or `dict[str, list[int]]`) or a batch of
3747-
tokenized inputs (list of [`BatchEncoding`], *dict[str, list[list[int]]]* or *list[dict[str,
3748-
list[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
3749-
collate function.
3750-
3751-
Instead of `list[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see
3752-
the note above for the return type.
3753-
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
3754-
Select a strategy to pad the returned sequences (according to the model's padding side and padding
3755-
index) among:
3756-
3757-
- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
3758-
sequence if provided).
3759-
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
3760-
acceptable input length for the model if that argument is not provided.
3761-
- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different
3762-
lengths).
3763-
max_length (`int`, *optional*):
3764-
Maximum length of the returned list and optionally padding length (see above).
3765-
pad_to_multiple_of (`int`, *optional*):
3766-
If set will pad the sequence to a multiple of the provided value.
3767-
3768-
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
3750+
encoded_inputs:
3751+
Dictionary of tokenized inputs (`list[int]`) or batch of tokenized inputs (`list[list[int]]`).
3752+
max_length: maximum length of the returned list and optionally padding length (see below).
3753+
Will truncate by taking into account the special tokens.
3754+
padding_strategy: PaddingStrategy to use for padding.
3755+
3756+
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
3757+
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
3758+
- PaddingStrategy.DO_NOT_PAD: Do not pad
3759+
The tokenizer padding sides are defined in `padding_side` argument:
3760+
3761+
- 'left': pads on the left of the sequences
3762+
- 'right': pads on the right of the sequences
3763+
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
3764+
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
37693765
`>= 7.5` (Volta).
3770-
padding_side (`str`, *optional*):
3766+
padding_side:
37713767
The side on which the model should have padding applied. Should be selected between ['right', 'left'].
37723768
Default value is picked from the class attribute of the same name.
3773-
return_attention_mask (`bool`, *optional*):
3774-
Whether to return the attention mask. If left to the default, will return the attention mask according
3775-
to the specific tokenizer's default, defined by the `return_outputs` attribute.
3776-
3777-
[What are attention masks?](../glossary#attention-mask)
3778-
return_tensors (`str` or [`~utils.TensorType`], *optional*):
3779-
If set, will return tensors instead of list of python integers. Acceptable values are:
3780-
3781-
- `'tf'`: Return TensorFlow `tf.constant` objects.
3782-
- `'pt'`: Return PyTorch `torch.Tensor` objects.
3783-
- `'np'`: Return Numpy `np.ndarray` objects.
3784-
verbose (`bool`, *optional*, defaults to `True`):
3785-
Whether or not to print more information and warnings.
3769+
return_attention_mask:
3770+
(optional) Set to False to avoid returning attention mask (default: set to model specifics)
37863771
"""
37873772
# Load from model defaults
37883773
if return_attention_mask is None:

0 commit comments

Comments
 (0)