@@ -3267,22 +3267,41 @@ def pad(
3267
3267
verbose : bool = True ,
3268
3268
) -> BatchEncoding :
3269
3269
"""
3270
- Pad a single encoded input or a batch of encoded inputs up to the maximum length of the batch or up to a
3271
- given maximum length. Padding side can be specified on the left or on the right.
3270
+ Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
3271
+ in the batch.
3272
+
3273
+ Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
3274
+ `self.pad_token_id` and `self.pad_token_type_id`).
3275
+
3276
+ Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
3277
+ text followed by a call to the `pad` method to get a padded encoding.
3278
+
3279
+ <Tip>
3280
+
3281
+ If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
3282
+ result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
3283
+ PyTorch tensors, you will lose the specific device of your tensors however.
3284
+
3285
+ </Tip>
3272
3286
3273
3287
Args:
3274
- encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]]` or `list[dict[str, list[int]]]`):
3275
- Tokenized inputs. Can be a single batch encoding, a list of batch encodings, a dictionary of entries
3276
- produced by a `tokenizer.encode_plus` or a list of dictionaries from a `tokenizer.batch_encode_plus`.
3288
+ encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]` or `list[dict[str, list[int]]]`):
3289
+ Tokenized inputs. Can represent one input ([`BatchEncoding`] or `dict[str, list[int]]`) or a batch of
3290
+ tokenized inputs (list of [`BatchEncoding`], *dict[str, list[list[int]]]* or *list[dict[str,
3291
+ list[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
3292
+ collate function.
3293
+
3294
+ Instead of `list[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see
3295
+ the note above for the return type.
3277
3296
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
3278
3297
Select a strategy to pad the returned sequences (according to the model's padding side and padding
3279
3298
index) among:
3280
3299
3281
- - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
3300
+ - `True` or `'longest'` (default) : Pad to the longest sequence in the batch (or no padding if only a single
3282
3301
sequence if provided).
3283
3302
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
3284
3303
acceptable input length for the model if that argument is not provided.
3285
- - `False` or `'do_not_pad'` (default) : No padding (i.e., can output a batch with sequences of different
3304
+ - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different
3286
3305
lengths).
3287
3306
max_length (`int`, *optional*):
3288
3307
Maximum length of the returned list and optionally padding length (see above).
@@ -3292,18 +3311,19 @@ def pad(
3292
3311
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
3293
3312
`>= 7.5` (Volta).
3294
3313
padding_side (`str`, *optional*):
3295
- 'right' or 'left'. If get_ येणार from the tokenizer, the `tokenizer.padding_side` will be used.
3314
+ The side on which the model should have padding applied. Should be selected between ['right', 'left'].
3315
+ Default value is picked from the class attribute of the same name.
3296
3316
return_attention_mask (`bool`, *optional*):
3297
3317
Whether to return the attention mask. If left to the default, will return the attention mask according
3298
- to the specific tokenizer's default.
3318
+ to the specific tokenizer's default, defined by the `return_outputs` attribute .
3299
3319
3300
3320
[What are attention masks?](../glossary#attention-mask)
3301
3321
return_tensors (`str` or [`~utils.TensorType`], *optional*):
3302
3322
If set, will return tensors instead of list of python integers. Acceptable values are:
3303
3323
3304
- - `'tf'`: Return TensorFlow `tf.Tensor ` objects.
3324
+ - `'tf'`: Return TensorFlow `tf.constant ` objects.
3305
3325
- `'pt'`: Return PyTorch `torch.Tensor` objects.
3306
- - `'np'`: Return NumPy `np.ndarray` objects.
3326
+ - `'np'`: Return Numpy `np.ndarray` objects.
3307
3327
verbose (`bool`, *optional*, defaults to `True`):
3308
3328
Whether or not to print more information and warnings.
3309
3329
"""
@@ -3724,65 +3744,30 @@ def _pad(
3724
3744
return_attention_mask : Optional [bool ] = None ,
3725
3745
) -> dict :
3726
3746
"""
3727
- Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
3728
- in the batch.
3729
-
3730
- Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
3731
- `self.pad_token_id` and `self.pad_token_type_id`).
3732
-
3733
- Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
3734
- text followed by a call to the `pad` method to get a padded encoding.
3735
-
3736
- <Tip>
3737
-
3738
- If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
3739
- result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
3740
- PyTorch tensors, you will lose the specific device of your tensors however.
3741
-
3742
- </Tip>
3747
+ Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
3743
3748
3744
3749
Args:
3745
- encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]` or `list[dict[str, list[int]]]`):
3746
- Tokenized inputs. Can represent one input ([`BatchEncoding`] or `dict[str, list[int]]`) or a batch of
3747
- tokenized inputs (list of [`BatchEncoding`], *dict[str, list[list[int]]]* or *list[dict[str,
3748
- list[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
3749
- collate function.
3750
-
3751
- Instead of `list[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see
3752
- the note above for the return type.
3753
- padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
3754
- Select a strategy to pad the returned sequences (according to the model's padding side and padding
3755
- index) among:
3756
-
3757
- - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
3758
- sequence if provided).
3759
- - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
3760
- acceptable input length for the model if that argument is not provided.
3761
- - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different
3762
- lengths).
3763
- max_length (`int`, *optional*):
3764
- Maximum length of the returned list and optionally padding length (see above).
3765
- pad_to_multiple_of (`int`, *optional*):
3766
- If set will pad the sequence to a multiple of the provided value.
3767
-
3768
- This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
3750
+ encoded_inputs:
3751
+ Dictionary of tokenized inputs (`list[int]`) or batch of tokenized inputs (`list[list[int]]`).
3752
+ max_length: maximum length of the returned list and optionally padding length (see below).
3753
+ Will truncate by taking into account the special tokens.
3754
+ padding_strategy: PaddingStrategy to use for padding.
3755
+
3756
+ - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
3757
+ - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
3758
+ - PaddingStrategy.DO_NOT_PAD: Do not pad
3759
+ The tokenizer padding sides are defined in `padding_side` argument:
3760
+
3761
+ - 'left': pads on the left of the sequences
3762
+ - 'right': pads on the right of the sequences
3763
+ pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
3764
+ This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
3769
3765
`>= 7.5` (Volta).
3770
- padding_side (`str`, *optional*) :
3766
+ padding_side:
3771
3767
The side on which the model should have padding applied. Should be selected between ['right', 'left'].
3772
3768
Default value is picked from the class attribute of the same name.
3773
- return_attention_mask (`bool`, *optional*):
3774
- Whether to return the attention mask. If left to the default, will return the attention mask according
3775
- to the specific tokenizer's default, defined by the `return_outputs` attribute.
3776
-
3777
- [What are attention masks?](../glossary#attention-mask)
3778
- return_tensors (`str` or [`~utils.TensorType`], *optional*):
3779
- If set, will return tensors instead of list of python integers. Acceptable values are:
3780
-
3781
- - `'tf'`: Return TensorFlow `tf.constant` objects.
3782
- - `'pt'`: Return PyTorch `torch.Tensor` objects.
3783
- - `'np'`: Return Numpy `np.ndarray` objects.
3784
- verbose (`bool`, *optional*, defaults to `True`):
3785
- Whether or not to print more information and warnings.
3769
+ return_attention_mask:
3770
+ (optional) Set to False to avoid returning attention mask (default: set to model specifics)
3786
3771
"""
3787
3772
# Load from model defaults
3788
3773
if return_attention_mask is None :
0 commit comments