docs: fix broken links (#66)

a-klos · web-flow · commit 0cbb43b6efed · 2025-07-31T10:03:19.000+02:00
This pull request primarily focuses on updating documentation and
correcting file paths to improve clarity and maintain consistency within
the project's `README.md` and `libs/README.md` files. The changes ensure
that the documentation reflects the correct structure and paths for
various components and services.

### Documentation Updates:

* **Corrected URL in `README.md`:** Updated the Tilt guide URL to
include the `.html` file extension for proper navigation. (`README.md`,
[README.mdL41-R41](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L41-R41))

* **Updated paths in `libs/README.md`:**
- Fixed file paths for `chunker` and `document_extractor` to point to
the correct modules. (`libs/README.md`,
[libs/README.mdL154-R158](diffhunk://#diff-34194a117b05d75d22ca968cdb7d540839dc7a0eb33960fbca668b5a6ade87cbL154-R158))
- Adjusted paths for extractors, converters, and related components
under the "Replaceable parts" section to reflect the updated directory
structure. (`libs/README.md`,
[libs/README.mdR226-R238](diffhunk://#diff-34194a117b05d75d22ca968cdb7d540839dc7a0eb33960fbca668b5a6ade87cbR226-R238))
diff --git a/README.md b/README.md
@@ -38,7 +38,7 @@ The template supports multiple LLM (Large Language Model) providers, such as STA
 
 
 ## 1. Getting Started
-A [`Tiltfile`](./Tiltfile) is provided to get you started :rocket:. If Tilt is new for you, and you want to learn more about it, please take a look at the [Tilt guides](https://docs.tilt.dev/tiltfile_authoring).
+A [`Tiltfile`](./Tiltfile) is provided to get you started :rocket:. If Tilt is new for you, and you want to learn more about it, please take a look at the [Tilt guides](https://docs.tilt.dev/tiltfile_authoring.html).
 
 ### 1.1 Components
 
diff --git a/libs/README.md b/libs/README.md
@@ -151,11 +151,11 @@ The extracted information will be summarized using LLM. The summary, as well as
 
 | Name | Type | Default | Notes |
 |----------|---------|--------------|--------------|
-| file_service | [`admin_api_lib.file_services.file_service.FileService`](./admin-api-lib/src/admin_api_lib/file_services/file_service.py) | [`admin_api_lib.impl.file_services.s3_service.S3Service`](./admin_api_lib/src/admin_api_lib/impl/file_services/s3_service.py) | Handles operations on the connected storage. |
+| file_service | [`admin_api_lib.file_services.file_service.FileService`](./admin-api-lib/src/admin_api_lib/file_services/file_service.py) | [`admin_api_lib.impl.file_services.s3_service.S3Service`](./admin-api-lib/src/admin_api_lib/impl/file_services/s3_service.py) | Handles operations on the connected storage. |
 | large_language_model | `langchain_core.language_models.llms.BaseLLM` | `langchain_community.llms.vllm.VLLMOpenAI` or `langchain_community.llms.Ollama` | The LLm that is used for all LLM tasks. The default depends on the value of `rag_core_lib.impl.settings.rag_class_types_settings.RAGClassTypeSettings.llm_type` |
 | key_value_store | [`admin_api_lib.impl.key_db.file_status_key_value_store.FileStatusKeyValueStore`](./admin-api-lib/src/admin_api_lib/impl/key_db/file_status_key_value_store.py) | [`admin_api_lib.impl.key_db.file_status_key_value_store.FileStatusKeyValueStore`](./admin-api-lib/src/admin_api_lib/impl/key_db/file_status_key_value_store.py) | Is used for storing the available sources and their current state. |
-| chunker |  [`admin_api_lib.impl.chunker.chunker.Chunker`](./admin-api-lib/src/admin_api_lib/impl/chunker/chunker.py) | [`admin_api_lib.impl.chunker.text_chunker.TextChunker`](./admin-api-lib/src/admin_api_lib/impl/chunker/text_chunker.py) | Used for splitting the documents in managable chunks. |
-| document_extractor | [`admin_api_lib.extractor_api_client.openapi_client.api.extractor_api.ExtractorApi`](./admin-api-lib/src/admin_api_lib/extractor_api_client/openapi_client/api/extractor_api.py) | [`admin_api_lib.extractor_api_client.openapi_client.api.extractor_api.ExtractorApi`](./admin-api-lib/src/admin_api_lib.extractor_api_client/openapi_client/api/extractor_api.py) | Needs to be replaced if adjustments to the `extractor-api` is made. |
+| chunker |  [`admin_api_lib.chunker.chunker.Chunker`](./admin-api-lib/src/admin_api_lib/chunker/chunker.py) | [`admin_api_lib.impl.chunker.text_chunker.TextChunker`](./admin-api-lib/src/admin_api_lib/impl/chunker/text_chunker.py) | Used for splitting the documents in managable chunks. |
+| document_extractor | [`admin_api_lib.extractor_api_client.openapi_client.api.extractor_api.ExtractorApi`](./admin-api-lib/src/admin_api_lib/extractor_api_client/openapi_client/api/extractor_api.py) | [`admin_api_lib.extractor_api_client.openapi_client.api.extractor_api.ExtractorApi`](./admin-api-lib/src/admin_api_lib/extractor_api_client/openapi_client/api/extractor_api.py) | Needs to be replaced if adjustments to the `extractor-api` is made. |
 | rag_api | [`admin_api_lib.rag_backend_client.openapi_client.api.rag_api.RagApi`](./admin-api-lib/src/admin_api_lib/rag_backend_client/openapi_client/api/rag_api.py) | [`admin_api_lib.rag_backend_client.openapi_client.api.rag_api.RagApi`](./admin-api-lib/src/admin_api_lib/rag_backend_client/openapi_client/api/rag_api.py) | Needs to be replaced if changes to the `/information_pieces/remove` or `/information_pieces/upload` of the [`rag-core-api`](#rag-core-api) are made. |
 | summarizer_prompt | `str` | [`admin_api_lib.prompt_templates.summarize_prompt.SUMMARIZE_PROMPT`](./admin-api-lib/src/admin_api_lib/prompt_templates/summarize_prompt.py) | The prompt used of the summarization. |
 | langfuse_manager | [`rag_core_lib.impl.langfuse_manager.langfuse_manager.LangfuseManager`](./rag-core-lib/src/rag_core_lib/impl/langfuse_manager/langfuse_manager.py) | [`rag_core_lib.impl.langfuse_manager.langfuse_manager.LangfuseManager`](./rag-core-lib/src/rag_core_lib/impl/langfuse_manager/langfuse_manager.py) | Retrieves additional settings, as well as the prompt from langfuse if available. |
@@ -225,16 +225,15 @@ Technically, all parameters of the `SitemapLoader` from LangChain can be provide
 
 | Name | Type | Default | Notes |
 |----------|---------|--------------|--------------|
-| file_service | [`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py) | [`extractor_api_lib.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/file_services/s3_service.py) | Handles operations on the connected storage. |
-| database_converter | [`extractor_api_lib.document_parser.table_converters.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/document_parser/table_converters/dataframe_converter.py) | [`extractor_api_lib.document_parser.table_converters.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/document_parser/table_converters/dataframe2markdown.py) | Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
-| pdf_extractor | [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) |[`extractor_api_lib.document_parser.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
-| ms_docs_extractor |  [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) |[`extractor_api_lib.document_parser.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
-| xml_extractor |  [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) | [`extractor_api_lib.document_parser.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/xml_extractor.py) | Extractor used for extracting content from XML documents. |
-| all_extractors | `dependency_injector.providers.List[extractor_api_lib.document_parser.information_extractor.InformationExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
-| general_extractor | [`extractor_api_lib.document_parser.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/information_extractor.py) |[`extractor_api_lib.document_parser.general_extractor.GeneralExtractor`](./extractor-api-lib/src/extractor_api_lib/document_parser/general_extractor.py) | Combines multiple extractors and decides which one to use for the given file format. |
-| file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) | [`extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/default_file_extractor.py) | Implementation of the `/extract_from_file` endpoint. Uses *general_extractor*. |
+| file_service | [`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py) | [`extractor_api_lib.impl.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/impl/file_services/s3_service.py) | Handles operations on the connected storage. |
+| database_converter | [`extractor_api_lib.table_converter.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/table_converter/dataframe_converter.py) | [`extractor_api_lib.impl.table_converter.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/impl/table_converter/dataframe2markdown.py) | Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
+| pdf_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
+| ms_docs_extractor |  [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) |[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
+| xml_extractor |  [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
+| all_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_extractor.InformationExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
+| general_file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) |[`extractor_api_lib.impl.api_endpoints.general_file_extractor.GeneralFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) | Combines multiple file extractors and decides which one to use for the given file format. |
 | general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint.  Will decide the correct extractor for the source. |
-| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/confluence_extractor.py) | Implementation of an esxtractor for the source `confluence`. |
+| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py) | Implementation of an extractor for the source `confluence`. |
 | sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
 | sitemap_parsing_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
 | sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |