Skip to content

Commit f98e973

Browse files
authored
feat: adds extractor for epub (#58)
1 parent 0cbb43b commit f98e973

File tree

11 files changed

+459
-281
lines changed

11 files changed

+459
-281
lines changed

.devcontainer/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ ARG DEBIAN_FRONTEND=noninteractive
44
ARG USER=vscode
55

66
RUN DEBIAN_FRONTEND=noninteractive \
7-
&& apt-get update \
7+
&& apt-get update \
88
&& apt-get install -y build-essential --no-install-recommends make \
99
ca-certificates \
1010
git \
@@ -27,7 +27,7 @@ RUN DEBIAN_FRONTEND=noninteractive \
2727
# Python and poetry installation
2828
USER $USER
2929
ARG HOME="/home/$USER"
30-
ARG PYTHON_VERSION=3.11
30+
ARG PYTHON_VERSION=3.13
3131

3232
ENV PYENV_ROOT="${HOME}/.pyenv"
3333
ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${HOME}/.local/bin:$PATH"
@@ -40,4 +40,4 @@ RUN echo "done 0" \
4040
&& pyenv global ${PYTHON_VERSION} \
4141
&& echo "done 3" \
4242
&& curl -sSL https://install.python-poetry.org | python3 - \
43-
&& poetry config virtualenvs.in-project true
43+
&& poetry config virtualenvs.in-project true

libs/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# RAG Core Libraries
22

33
This directory contains the core libraries of the STACKIT RAG template.
4-
These libraries provide comprehensive document extraction capabilities including support for files (PDF, DOCX, XML), web sources via sitemaps, and Confluence pages.
4+
These libraries provide comprehensive document extraction capabilities including support for files (PDF, DOCX, XML, EPUB), web sources via sitemaps, and Confluence pages.
55
It consists of the following python packages:
66

77
- [`1. Rag Core API`](#1-rag-core-api)
@@ -230,7 +230,8 @@ Technically, all parameters of the `SitemapLoader` from LangChain can be provide
230230
| pdf_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
231231
| ms_docs_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232232
| xml_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
233-
| all_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_extractor.InformationExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
233+
| epub_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) | [`extractor_api_lib.impl.extractors.file_extractors.epub_extractor.EPUBExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/epub_extractor.py) | Extractor used for extracting content from EPUB documents. |
234+
| file_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor, epub_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
234235
| general_file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) |[`extractor_api_lib.impl.api_endpoints.general_file_extractor.GeneralFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) | Combines multiple file extractors and decides which one to use for the given file format. |
235236
| general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
236237
| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py) | Implementation of an extractor for the source `confluence`. |

libs/extractor-api-lib/poetry.lock

Lines changed: 18 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

libs/extractor-api-lib/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ markdownify = "^1.1.0"
100100
langchain-core = "0.3.63"
101101
camelot-py = {extras = ["cv"], version = "^1.0.0"}
102102
fake-useragent = "^2.2.0"
103+
pypandoc-binary = "^1.15"
103104

104105
[tool.poetry.group.dev.dependencies]
105106
pytest = "^8.3.5"

libs/extractor-api-lib/src/extractor_api_lib/dependency_container.py

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,21 @@
33
from dependency_injector.containers import DeclarativeContainer
44
from dependency_injector.providers import Factory, List, Singleton # noqa: WOT001
55

6-
from extractor_api_lib.impl.api_endpoints.general_source_extractor import GeneralSourceExtractor
6+
from extractor_api_lib.impl.api_endpoints.general_file_extractor import (
7+
GeneralFileExtractor,
8+
)
9+
from extractor_api_lib.impl.api_endpoints.general_source_extractor import (
10+
GeneralSourceExtractor,
11+
)
712
from extractor_api_lib.impl.extractors.confluence_extractor import ConfluenceExtractor
8-
from extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor import MSDocsExtractor
13+
from extractor_api_lib.impl.extractors.file_extractors.epub_extractor import (
14+
EpubExtractor,
15+
)
16+
from extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor import (
17+
MSDocsExtractor,
18+
)
919
from extractor_api_lib.impl.extractors.file_extractors.pdf_extractor import PDFExtractor
1020
from extractor_api_lib.impl.extractors.file_extractors.xml_extractor import XMLExtractor
11-
from extractor_api_lib.impl.api_endpoints.general_file_extractor import GeneralFileExtractor
1221
from extractor_api_lib.impl.extractors.sitemap_extractor import SitemapExtractor
1322
from extractor_api_lib.impl.file_services.s3_service import S3Service
1423
from extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece import (
@@ -17,7 +26,12 @@
1726
from extractor_api_lib.impl.mapper.internal2external_information_piece import (
1827
Internal2ExternalInformationPiece,
1928
)
20-
from extractor_api_lib.impl.mapper.sitemap_document2information_piece import SitemapLangchainDocument2InformationPiece
29+
from extractor_api_lib.impl.mapper.langchain_document2information_piece import (
30+
LangchainDocument2InformationPiece,
31+
)
32+
from extractor_api_lib.impl.mapper.sitemap_document2information_piece import (
33+
SitemapLangchainDocument2InformationPiece,
34+
)
2135
from extractor_api_lib.impl.settings.pdf_extractor_settings import PDFExtractorSettings
2236
from extractor_api_lib.impl.settings.s3_settings import S3Settings
2337
from extractor_api_lib.impl.table_converter.dataframe2markdown import DataFrame2Markdown
@@ -44,12 +58,15 @@ class DependencyContainer(DeclarativeContainer):
4458
xml_extractor = Singleton(XMLExtractor, file_service)
4559

4660
intern2external = Singleton(Internal2ExternalInformationPiece)
47-
langchain_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
61+
confluence_langchain_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
62+
langchain_document2information_piece = Singleton(LangchainDocument2InformationPiece)
4863
sitemap_document2information_piece = Singleton(SitemapLangchainDocument2InformationPiece)
49-
file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor)
64+
epub_extractor = Singleton(EpubExtractor, file_service, langchain_document2information_piece)
65+
66+
file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor, epub_extractor)
5067

5168
general_file_extractor = Singleton(GeneralFileExtractor, file_service, file_extractors, intern2external)
52-
confluence_extractor = Singleton(ConfluenceExtractor, mapper=langchain_document2information_piece)
69+
confluence_extractor = Singleton(ConfluenceExtractor, mapper=confluence_langchain_document2information_piece)
5370

5471
sitemap_extractor = Singleton(
5572
SitemapExtractor,
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
"""Module containing the EpubExtractor class."""
2+
3+
import logging
4+
from pathlib import Path
5+
6+
from langchain_community.document_loaders import UnstructuredEPubLoader
7+
8+
from extractor_api_lib.extractors.information_file_extractor import (
9+
InformationFileExtractor,
10+
)
11+
from extractor_api_lib.file_services.file_service import FileService
12+
from extractor_api_lib.impl.mapper.langchain_document2information_piece import (
13+
LangchainDocument2InformationPiece,
14+
)
15+
from extractor_api_lib.impl.types.file_type import FileType
16+
from extractor_api_lib.models.dataclasses.internal_information_piece import (
17+
InternalInformationPiece,
18+
)
19+
20+
logger = logging.getLogger(__name__)
21+
22+
23+
class EpubExtractor(InformationFileExtractor):
24+
"""Extractor for Epub documents using unstructured library."""
25+
26+
def __init__(
27+
self,
28+
file_service: FileService,
29+
mapper: LangchainDocument2InformationPiece,
30+
):
31+
"""Initialize the EpubExtractor.
32+
33+
Parameters
34+
----------
35+
file_service : FileService
36+
Handler for downloading the file to extract content from and upload results to if required.
37+
mapper : LangchainDocument2InformationPiece
38+
An instance of LangchainDocument2InformationPiece used for mapping langchain documents
39+
to information pieces.
40+
"""
41+
super().__init__(file_service=file_service)
42+
self._mapper = mapper
43+
44+
@property
45+
def compatible_file_types(self) -> list[FileType]:
46+
"""
47+
List of compatible file types for the EPUB extractor.
48+
49+
Returns
50+
-------
51+
list[FileType]
52+
A list containing the compatible file types, which in this case is EPUB.
53+
"""
54+
return [FileType.EPUB]
55+
56+
async def aextract_content(self, file_path: Path, name: str) -> list[InternalInformationPiece]:
57+
"""
58+
Extract content from an epub file and processes the elements.
59+
60+
Parameters
61+
----------
62+
file_path : Path
63+
The path to the epub file to be processed.
64+
name : str
65+
Name of the document.
66+
67+
Returns
68+
-------
69+
list[InformationPiece]
70+
A list of processed information pieces extracted from the epub file.
71+
"""
72+
elements = UnstructuredEPubLoader(file_path.as_posix()).load()
73+
return [self._mapper.map_document2informationpiece(document=x, document_name=name) for x in elements]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
"""Module for the LangchainDocument2InformationPiece class."""
2+
3+
from extractor_api_lib.mapper.source_langchain_document2information_piece import (
4+
SourceLangchainDocument2InformationPiece,
5+
)
6+
7+
8+
class LangchainDocument2InformationPiece(SourceLangchainDocument2InformationPiece):
9+
"""A class to map a LangchainDocument to an InformationPiece."""
10+
11+
def _map_meta(self, internal: dict, document_name: str) -> dict:
12+
return internal

libs/extractor-api-lib/src/extractor_api_lib/impl/types/file_type.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ class FileType(StrEnum):
1111
DOCX = "DOCX"
1212
PPTX = "PPTX"
1313
XML = "XML"
14+
EPUB = "EPUB"
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
"""Comprehensive test suite for SitemapExtractor class."""
2+
3+
from pathlib import Path
4+
5+
import pytest
6+
7+
from extractor_api_lib.impl.extractors.file_extractors.epub_extractor import (
8+
EpubExtractor,
9+
)
10+
from extractor_api_lib.impl.mapper.langchain_document2information_piece import (
11+
LangchainDocument2InformationPiece,
12+
)
13+
from extractor_api_lib.impl.types.file_type import FileType
14+
from extractor_api_lib.models.content_type import ContentType
15+
16+
17+
class TestEpubExtractor:
18+
"""Test class for EpubExtractor."""
19+
20+
@pytest.fixture
21+
def mapper(self) -> LangchainDocument2InformationPiece:
22+
return LangchainDocument2InformationPiece()
23+
24+
@pytest.fixture
25+
def epub_extractor(self, mock_file_service, mapper):
26+
"""Create a EpubExtractor instance for testing."""
27+
return EpubExtractor(file_service=mock_file_service, mapper=mapper)
28+
29+
def test_init(self, mock_file_service, mapper):
30+
"""Test EpubExtractor initialization."""
31+
extractor = EpubExtractor(file_service=mock_file_service, mapper=mapper)
32+
assert extractor._mapper == mapper
33+
assert extractor._file_service == mock_file_service
34+
35+
def test_file_type(self, epub_extractor):
36+
"""Test that extractor_type returns EPUB."""
37+
assert epub_extractor.compatible_file_types == [FileType.EPUB]
38+
39+
@pytest.mark.asyncio
40+
async def test_extract_content_success(self, epub_extractor):
41+
page_content = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam"
42+
43+
test_data_dir = Path(__file__).parent / "test_data"
44+
45+
file_path = test_data_dir / "LoremIpsum.epub"
46+
result = await epub_extractor.aextract_content(file_path, file_path.name)
47+
48+
assert len(result) == 1
49+
assert result[0].type == ContentType.TEXT
50+
assert result[0].page_content == page_content
2.26 KB
Binary file not shown.

0 commit comments

Comments
 (0)