Merge branch 'main' into framework

cooktheryan · web-flow · commit 1a8cf9eb5e22 · 2025-01-02T09:43:21.000-05:00
Signed-off-by: Ryan Cook &lt;rcook@redhat.com&gt;
diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt
@@ -29,6 +29,7 @@ Containerfile
 cpp
 cuBLAS
 CUDA
+ctrl
 customizations
 CVE
 CVEs
@@ -43,6 +44,7 @@ Dependabot
 dev
 disambiguating
 ditaa
+Docling
 docstring
 downstreams
 dr
@@ -66,6 +68,7 @@ gguf
 GGUFs
 ggufs
 GiB
+github
 Gmail
 GPTDolomite
 gpu
@@ -79,10 +82,12 @@ ilab
 Ilya
 impactful
 Inferencing
+init
 instantiation
 instructlab
 io
 ISA
+init
 iters
 itertools
 Jie
@@ -119,6 +124,9 @@ mixtral
 MLX
 mlx
 MMLU
+modularize
+modularized
+Murdock
 Nakamura
 natively
 networkx
@@ -136,10 +144,12 @@ OpenAI
 optimizers
 orchestrator
 ots
+PaRAGon
 Params
 Pareja
 PEFT
 Pereira
+PID
 PlantUML
 PLOS
 pluggable
@@ -148,10 +158,12 @@ POC
 Podman
 podman
 posthog
+postprocessing
 pre
 preprint
 preprocessing
 prereqs
+productize
 productized
 PR's
 PSFL
@@ -162,6 +174,7 @@ pyproject
 PyTorch
 pyyaml
 qlora
+qna
 quantized
 Quantizing
 Radeon
@@ -199,8 +212,10 @@ Staar
 subcommand
 subcommands
 subdirectory
+subprocess
 Sudalairaj
 supportability
+systemd
 Taj
 tatsu
 TBD
@@ -219,7 +234,10 @@ triagers
 UI
 ui
 unquantized
+unstaged
 USM
+UUID
+UUIDs
 UX
 vectordbs
 venv
diff --git a/docs/cli/ilab-processes.md b/docs/cli/ilab-processes.md
@@ -0,0 +1,64 @@
+# Processes in InstructLab
+
+The ability to detach from processes is crucial to the user experience of InstructLab. However, the concept of multi-processing, process management, and the monitoring of processes is very complex.
+
+It is important to try and add this concept in as simply as possible, expanding on the state reporting, logging, and other features as we go along.
+
+## Phased approach to InstructLab Processes
+
+This document is going to describe phase 1 of implementing processes in InstructLab. Phase 1 is to be described as the "ilab simple process management system". This will depend purely on python packages, PID tracking, and log files to create the experience of detachable processes. The key here is the concept of the UUID, allowing a future REST API to keep track of InstructLab processes using these unique identifiers.
+
+We can re-visit all this in phase 2, when we discuss if we want to utilize something like systemd or a more in-depth process-monitor repo to track processes.
+
+### Phase 1
+
+Phase one would focus on adding the concept of detaching from processes, re-attaching to them, and managing the various artifacts from the processes.
+
+Process management would only apply to `ilab data generate` and `ilab model train` in a first iteration. This would be followed by commands like `ilab model evaluate`, `ilab model serve`, and `ilab model download`. All of these commands have long running processes that would benefit from detachment.
+
+The workflow would allow for:
+
+`ilab data generate -dt` (run a detached generation process)
+`ilab model train -dt` (run a detached training process)
+
+`ilab process list`
+
+```console=
++------------+-------+--------------------------------------+------------------------------------------------------------------------------------------------------------------+----------+
+| Type       | PID   | UUID                                 | Log File                                                                                                         | Runtime  |
++------------+-------+--------------------------------------+------------------------------------------------------------------------------------------------------------------+----------+
+| Generation | 39832 | 82d00a5b-5ed5-4cfd-9a75-a87e4f420b27 | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-82d00a5b-5ed5-4cfd-9a75-a87e4f420b27.log | 69:26:28 |
+| Generation | 40791 | 09f9d301-4fd9-4045-bfda-8a56f1d96016 | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-09f9d301-4fd9-4045-bfda-8a56f1d96016.log | 68:45:40 |
+| Generation | 47390 | 4ccabfa5-604f-49c6-b5c3-730ce328d62a | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-4ccabfa5-604f-49c6-b5c3-730ce328d62a.log | 67:26:33 |
+| Generation | 50872 | 093ac2e9-080c-45fe-89c5-43d508d6369c | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-093ac2e9-080c-45fe-89c5-43d508d6369c.log | 05:24:56 |
++------------+-------+--------------------------------------+------------------------------------------------------------------------------------------------------------------+----------+
+```
+
+`ilab process attach <UUID>`
+
+This command would re-attach to the given process, allowing to user to view the live logs of the process. `attach` would trail the log file and listen for user-input to kill the process.
+
+These commands will be done in a very simple way at first using the following architecture:
+
+1. a detached process be re-attachable by tailing the log file and then allowing the user to ctrl+c the process as normal using `KeyboardInterrupt`
+2. The process registry will be maintained for tracking UUIDs created via the `uuid` python package, the PID of the actual process, a `log_file` where the process will be outputting its logs to so that the user can re-attach, and the start time of the process. The log file directory will be tracked using our `DEFAULTS` package and will be standard throughout releases.
+
+The general flow would be:
+
+1. a user runs `ilab data generate -dt`
+2. a UUID, PID, and log file is added to the process registry.
+3. the process would exit, and print the UUID of the sdg run
+4. a user could attach to this process using `ilab process attach <UUID>`.
+5. This command would look in the process registry for the PID and/or UUID, get the log file, tail the log file, and listen for a ctrl+c keyboard interrupt.
+
+This allows us to detach from processes while still running them in the background and maintain log files all without the use of anything other than UUID and subprocess.
+
+#### Log file management
+
+If existing log files from the various libraries exist, those will be used in this scenario. If they do not, InstructLab will manage writing process logs to disk. Regardless of whether the libraries maintain their own log file, InstructLab will need to co-locate the log files in a centralized directory.
+
+If a log file exists, it will be copied and renamed into the following directory format:
+
+`~/.local/share/instructlab/logs/<command_name>/<command_name>-<timestamp>.log`
+
+If the log file does not exist, InstructLab will create one with this format. Libraries are responsible for standardizing where their logs are stored if they already exist so the Core package can access them in a uniform fashion and copy them to the proper directory.
diff --git a/docs/rag/rag-initial-code-location.md b/docs/rag/rag-initial-code-location.md
@@ -0,0 +1,109 @@
+# Code location for RAG
+
+| Created  | Dec 5, 2024 |
+| -------- | -------- |
+| Authors | Bill Murdock |
+| Replaces | N/A |
+| Replaced by | N/A |
+
+## What
+
+We want a retrieval-augmented generation (RAG) capability that provides outstanding results with minimal effort, is seamlessly integrated with InstructLab, and is also general enough to be used in other applications as well.
+
+## Why
+
+Many InstructLab users want to train a model and then use it to RAG.  Often they build something simple themselves for this purpose.  Two problems with this approach:
+
+- Building their own RAG is extra work.
+- Users who are not experts on RAG might not build a RAG that provides outstanding results.
+
+There is a very simple RAG capability at <https://github.com/instructlab/rag> .  It is not tightly integrated with InstructLab and it does not use any advanced RAG capabilities.  However, we have a request from a stakeholder to not just unilaterally delete it or replace it with something radically different.
+
+## Goals
+
+Provide a built-in alternative for users who do not want to build their own RAG.  Keep the existing capability at <https://github.com/instructlab/rag> somewhere, but potentially somewhere other than it is now (e.g., in a new branch of the existing repository).
+
+## Non-goals
+
+Evaluation of RAG will be addressed in one or more other development documents.  That topic is out of scope for this document.
+
+## Decision
+
+- For now, RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
+
+## How
+
+### Phase 1
+
+- RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
+- This directory will include all of the following:
+  - Loading the content from Docling-format JSON files (that are produced by SDG preprocessing).
+  - Chunking that content to sizes that fit the requirements of the selected embedding model for vector database storage and retrieval.
+  - Storing those chunks with their vector representations in a vector database.
+  - End-to-end runtime RAG.  The initial version of this includes the following:
+    - Taking as input a session history (including a current user query) and providing a response (e.g., something along the lines of the [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create)).
+    - During that processing, it retrieves relevant search results from the vector database, it converts those into a prompt to send to the response generation model, it prompts that model, and it returns the response from that model.
+- This will be invoked from the existing `ilab` CLI, as described in the [RAG ingestion and chat pipelines](https://github.com/instructlab/dev-docs/pull/161) dev doc.
+
+### Future phases
+
+- In the near future, RAG might be moved to the existing <https://github.com/instructlab/rag> repository.
+  - If so, something will be done with the existing code in <https://github.com/instructlab/rag>, e.g., moving it to a branch of that repository or moving it to a different repository.
+- Alternatively, some or all of it might move to a new repository.
+  - For example, maybe the indexing and retrieval portions move to a separate retrieval repository while the rest of end-to-end runtime RAG might move somewhere else.
+- If/when we move ahead with any of these options, *we will open a new ADR for that decision*.
+- Also, the capabilities will keep improving and adding more functionality.
+
+## Alternatives
+
+- Put the indexing and run-time RAG code in a new repository.
+  - Pro: Having a dedicated repository gives the RAG team the most freedom and flexibility to make technical decisions that work for that team.
+  - Pro: Starting with a new repository provides a blank slate that can be set up in whatever way makes the most sense for that functionality.
+  - Pro: Having the capability in one repository makes it easier for consumers such as RamaLama to reuse it for their purposes too.
+  - Con: Creating and configuring a new repository is some work.  (This is a fairly small con, but a real one.)
+  - Con: Integrating a new repository into the continuous integration and delivery capabilities for both upstream InstructLab and downstream consumers is a *lot* of work.  This is a much bigger con.
+  - Con: All that extra work would almost certainly result in slower time to market.  This risks missing some market opportunities.
+- Put the indexing code in <https://github.com/instructlab/sdg> (SDG) and the run-time RAG code in <https://github.com/instructlab/instructlab> (core)
+  - Pro: This has the advantage of not adding any new dependencies.
+  - Pro: The document processing is already in SDG and chat functionality is already in core so this would require the fewest code changes.
+  - Con: Splitting the RAG functionality across multiple repositories makes it more complicated to reuse in other applications outside of InstructLab.
+  - Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG.  If those components are split across multiple repositories, that will make delivering such changes more complicated.
+- Start by putting the code into existing InstructLab repositories (either of the above options) and then split if off into its own repository later.
+  - Pro: Gets us integrated into InstructLab sooner.
+  - Con: Adds extra work to the second phase where we have to split it off into its own repository.
+  - Con: There is a risk that we never get around to splitting it off and we wind up stuck with the cons of being jammed in to other components indefinitely.
+- Put the indexing and run-time RAG code in a new repo outside <https://github.com/instructlab/>.
+  - Pro: This signals that this is not specific to InstructLab but is instead intended to be useful in a variety of applications.  That makes it more likely the work could have broader impact.
+  - Con: If we put this out there as something that is intended to be useful in a variety of applications, the pressure is on us to make sure it is differentiated from other broadly applicable RAG capabilities.  Hopefully that will be true eventually, but it probably won't be true for a while.  It might make more sense to give this some time to mature as a local component of InstructLab before trying to spin it off as its own thing.
+  - Con: If we put it out there as its own open source project, that project needs all of the infrastructure of a full open source activity (governing structures, communication tools and protocols, etc.).  That's a lot of work to set up.  Keeping it inside InstructLab for now lets us keep using the infrastructure that InstructLab has for this purpose).
+  - Con: If we put it out there as its own open source project, it needs a name.  It is a lot of work to come up with a good name and there will be a lot of stakeholders with an interest in the name that comes up.
+- Keep the indexing and run-time RAG code in <https://github.com/redhat-et/PaRAGon> which is an emerging technologies prototype for this work.
+  - Mostly the same pros and cons as putting it in a new repo outside InstructLab plus the following:
+  - Pro: A prototype for the code we want is already there.
+  - Pro: It already has its own distinctive name (PaRAGon).
+  - Con: The existing repository has its own simple command-line interface which is useful for the prototype but we don't want it in the capability we release because too many command-line interfaces will confuse users.
+  - Con: The name PaRAGon seems fine to me, but probably more stakeholders need to weigh in on what a name would be.
+  - Con: The `redhat-et` label suggests that this is something "owned" by Red Hat which makes sense for the prototype but not so much for something we want a community to own in the long run.
+- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND keep the existing RAG functionality in that repository intact.
+  - Pro: It already exists.
+  - Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
+  - Con: It creates the confusion of having two different RAG solutions in the same repository.  We could mitigate that with developer documentation and marking legacy stuff as "deprecated".
+- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND eliminate the existing RAG functionality in that repository.
+  - Pro: It already exists.
+  - Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
+  - Pro: It avoids the confusion of having two different RAG solutions since we'd be eliminating the old one.
+  - Con: There is still some interest in keeping this around.
+
+## Risks
+
+- Putting the RAG functionality in the core repository requires any application that wants to use this functionality to bring in the entire core which then brings in all of the libraries it depend on, so this becomes an enormous dependency.  This discourages reuse in other applications.  It *encourages* either of the following behaviors that would be unfortunate:
+  - Other applications pull directly from <https://github.com/redhat-et/PaRAGon> and in doing so duplicate the ongoing effort to harden that code base.
+  - Other applications may implement their own RAG solutions or pull from some other upstream unrelated to ours.
+- As noted earlier, putting the capability inside <https://github.com/instructlab/> signals that this is a component of InstructLab and not a generally useful feature.  That creates a risk that the work could miss out on additional opportunities for impact.  We hope to mitigate that risk by spinning it off to its own open source project when it is mature enough, but there is a risk that we will get distracted by other things and never get around to this.
+- The flow for document processing for InstructLab winds up being quite complicated in this proposal.  Since the existing document processing is in SDG, the flow for indexing for RAG winds up being a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core `/data` directory which then calls out the the `core/rag` directory for chunking and vector database indexing).  Having the document processing move from core to SDG and back to core and forward to RAG makes that capability more difficult to understand and maintain.  This complexity will be partially mitigated when the preprocessing code moves from SDG to core.  It will be further mitigated by having a clear, well-documented contract between core and the RAG repository indicating the responsibilities of each.
+
+## References
+
+- <https://github.com/redhat-et/PaRAGon>
+- <https://github.com/instructlab>
+- <https://github.com/instructlab/rag>
diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md