Skip to content

source_resegmenter is a Python library for re-segmenting a text into lines in a way that it matches a reference text in another language.

License

Notifications You must be signed in to change notification settings

hlt-mt/source-resegmenter

Repository files navigation

Source Resegmenter

source_resegmenter is a Python library for re-segmenting a text into lines in a way that it matches a reference text in another language.

The repository is tested using Python 3.11. Although it may work also with other Python versions, we do not ensure compatibility with them. Check out the Usage section for instructions on how to use the repository and the Installation section for further information about how to install the project.

Installation

You can install the latest stable version from PyPI:

pip install source_resegmenter

Or, to install from source:

git clone https://github.com/hlt-mt/source_resegmenter.git
cd source_resegmenter
pip install .

For development (with docs and testing tools):

pip install -e .[dev]

Usage

This library assumes that 3 txt files are available:

  1. The source text to be re-segmented, whose segmentation into lines has to be refined to match that of a reference file;
  2. The reference text, to which we want to obtain a line-level alignment of the source text;
  3. A backtranslation of the reference text into the source language, aligned at the line level with the reference text.

Once these three txt files are available, this tool can be used from command line as:

source_resegmenter --source-texts asr_audio_1.en --reference-texts audio_1_ref.it \
    --backtranslation-texts mt_audio_1_ref.en --output resegm_audio_1.en

Contributing

Contributions from interested researchers and developers are extremely appreciated.

You can create an issue in case of problems with the code, questions, or feature requests. You are also more than welcome to create a pull request that addresses any issue.

Licence

source_resegmenter_ is licensed under Apache Version 2.0.

Credits

If you use this library, please cite:

@inproceedings{cettolo-et-al-2025-xlr-segmenter,
    title={{How to Evaluate Speech Translation with Source-Aware Neural MT Metrics}},
    author={Cettolo, Mauro and Gaido, Marco and Negri, Matteo and Papi, Sara and Bentivogli, Luisa},
    booktitle = "",
    address = "",
    year={2025}
}

About

source_resegmenter is a Python library for re-segmenting a text into lines in a way that it matches a reference text in another language.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages