Skip to content

Commit 1a8cf9e

Browse files
authored
Merge branch 'main' into framework
Signed-off-by: Ryan Cook <[email protected]>
2 parents eee82de + ad293d2 commit 1a8cf9e

File tree

4 files changed

+360
-0
lines changed

4 files changed

+360
-0
lines changed

.spellcheck-en-custom.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Containerfile
2929
cpp
3030
cuBLAS
3131
CUDA
32+
ctrl
3233
customizations
3334
CVE
3435
CVEs
@@ -43,6 +44,7 @@ Dependabot
4344
dev
4445
disambiguating
4546
ditaa
47+
Docling
4648
docstring
4749
downstreams
4850
dr
@@ -66,6 +68,7 @@ gguf
6668
GGUFs
6769
ggufs
6870
GiB
71+
github
6972
Gmail
7073
GPTDolomite
7174
gpu
@@ -79,10 +82,12 @@ ilab
7982
Ilya
8083
impactful
8184
Inferencing
85+
init
8286
instantiation
8387
instructlab
8488
io
8589
ISA
90+
init
8691
iters
8792
itertools
8893
Jie
@@ -119,6 +124,9 @@ mixtral
119124
MLX
120125
mlx
121126
MMLU
127+
modularize
128+
modularized
129+
Murdock
122130
Nakamura
123131
natively
124132
networkx
@@ -136,10 +144,12 @@ OpenAI
136144
optimizers
137145
orchestrator
138146
ots
147+
PaRAGon
139148
Params
140149
Pareja
141150
PEFT
142151
Pereira
152+
PID
143153
PlantUML
144154
PLOS
145155
pluggable
@@ -148,10 +158,12 @@ POC
148158
Podman
149159
podman
150160
posthog
161+
postprocessing
151162
pre
152163
preprint
153164
preprocessing
154165
prereqs
166+
productize
155167
productized
156168
PR's
157169
PSFL
@@ -162,6 +174,7 @@ pyproject
162174
PyTorch
163175
pyyaml
164176
qlora
177+
qna
165178
quantized
166179
Quantizing
167180
Radeon
@@ -199,8 +212,10 @@ Staar
199212
subcommand
200213
subcommands
201214
subdirectory
215+
subprocess
202216
Sudalairaj
203217
supportability
218+
systemd
204219
Taj
205220
tatsu
206221
TBD
@@ -219,7 +234,10 @@ triagers
219234
UI
220235
ui
221236
unquantized
237+
unstaged
222238
USM
239+
UUID
240+
UUIDs
223241
UX
224242
vectordbs
225243
venv

docs/cli/ilab-processes.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Processes in InstructLab
2+
3+
The ability to detach from processes is crucial to the user experience of InstructLab. However, the concept of multi-processing, process management, and the monitoring of processes is very complex.
4+
5+
It is important to try and add this concept in as simply as possible, expanding on the state reporting, logging, and other features as we go along.
6+
7+
## Phased approach to InstructLab Processes
8+
9+
This document is going to describe phase 1 of implementing processes in InstructLab. Phase 1 is to be described as the "ilab simple process management system". This will depend purely on python packages, PID tracking, and log files to create the experience of detachable processes. The key here is the concept of the UUID, allowing a future REST API to keep track of InstructLab processes using these unique identifiers.
10+
11+
We can re-visit all this in phase 2, when we discuss if we want to utilize something like systemd or a more in-depth process-monitor repo to track processes.
12+
13+
### Phase 1
14+
15+
Phase one would focus on adding the concept of detaching from processes, re-attaching to them, and managing the various artifacts from the processes.
16+
17+
Process management would only apply to `ilab data generate` and `ilab model train` in a first iteration. This would be followed by commands like `ilab model evaluate`, `ilab model serve`, and `ilab model download`. All of these commands have long running processes that would benefit from detachment.
18+
19+
The workflow would allow for:
20+
21+
`ilab data generate -dt` (run a detached generation process)
22+
`ilab model train -dt` (run a detached training process)
23+
24+
`ilab process list`
25+
26+
```console=
27+
+------------+-------+--------------------------------------+------------------------------------------------------------------------------------------------------------------+----------+
28+
| Type | PID | UUID | Log File | Runtime |
29+
+------------+-------+--------------------------------------+------------------------------------------------------------------------------------------------------------------+----------+
30+
| Generation | 39832 | 82d00a5b-5ed5-4cfd-9a75-a87e4f420b27 | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-82d00a5b-5ed5-4cfd-9a75-a87e4f420b27.log | 69:26:28 |
31+
| Generation | 40791 | 09f9d301-4fd9-4045-bfda-8a56f1d96016 | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-09f9d301-4fd9-4045-bfda-8a56f1d96016.log | 68:45:40 |
32+
| Generation | 47390 | 4ccabfa5-604f-49c6-b5c3-730ce328d62a | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-4ccabfa5-604f-49c6-b5c3-730ce328d62a.log | 67:26:33 |
33+
| Generation | 50872 | 093ac2e9-080c-45fe-89c5-43d508d6369c | /Users/charliedoern/.local/share/instructlab/logs/generation/generation-093ac2e9-080c-45fe-89c5-43d508d6369c.log | 05:24:56 |
34+
+------------+-------+--------------------------------------+------------------------------------------------------------------------------------------------------------------+----------+
35+
```
36+
37+
`ilab process attach <UUID>`
38+
39+
This command would re-attach to the given process, allowing to user to view the live logs of the process. `attach` would trail the log file and listen for user-input to kill the process.
40+
41+
These commands will be done in a very simple way at first using the following architecture:
42+
43+
1. a detached process be re-attachable by tailing the log file and then allowing the user to ctrl+c the process as normal using `KeyboardInterrupt`
44+
2. The process registry will be maintained for tracking UUIDs created via the `uuid` python package, the PID of the actual process, a `log_file` where the process will be outputting its logs to so that the user can re-attach, and the start time of the process. The log file directory will be tracked using our `DEFAULTS` package and will be standard throughout releases.
45+
46+
The general flow would be:
47+
48+
1. a user runs `ilab data generate -dt`
49+
2. a UUID, PID, and log file is added to the process registry.
50+
3. the process would exit, and print the UUID of the sdg run
51+
4. a user could attach to this process using `ilab process attach <UUID>`.
52+
5. This command would look in the process registry for the PID and/or UUID, get the log file, tail the log file, and listen for a ctrl+c keyboard interrupt.
53+
54+
This allows us to detach from processes while still running them in the background and maintain log files all without the use of anything other than UUID and subprocess.
55+
56+
#### Log file management
57+
58+
If existing log files from the various libraries exist, those will be used in this scenario. If they do not, InstructLab will manage writing process logs to disk. Regardless of whether the libraries maintain their own log file, InstructLab will need to co-locate the log files in a centralized directory.
59+
60+
If a log file exists, it will be copied and renamed into the following directory format:
61+
62+
`~/.local/share/instructlab/logs/<command_name>/<command_name>-<timestamp>.log`
63+
64+
If the log file does not exist, InstructLab will create one with this format. Libraries are responsible for standardizing where their logs are stored if they already exist so the Core package can access them in a uniform fashion and copy them to the proper directory.

docs/rag/rag-initial-code-location.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Code location for RAG
2+
3+
| Created | Dec 5, 2024 |
4+
| -------- | -------- |
5+
| Authors | Bill Murdock |
6+
| Replaces | N/A |
7+
| Replaced by | N/A |
8+
9+
## What
10+
11+
We want a retrieval-augmented generation (RAG) capability that provides outstanding results with minimal effort, is seamlessly integrated with InstructLab, and is also general enough to be used in other applications as well.
12+
13+
## Why
14+
15+
Many InstructLab users want to train a model and then use it to RAG. Often they build something simple themselves for this purpose. Two problems with this approach:
16+
17+
- Building their own RAG is extra work.
18+
- Users who are not experts on RAG might not build a RAG that provides outstanding results.
19+
20+
There is a very simple RAG capability at <https://github.com/instructlab/rag> . It is not tightly integrated with InstructLab and it does not use any advanced RAG capabilities. However, we have a request from a stakeholder to not just unilaterally delete it or replace it with something radically different.
21+
22+
## Goals
23+
24+
Provide a built-in alternative for users who do not want to build their own RAG. Keep the existing capability at <https://github.com/instructlab/rag> somewhere, but potentially somewhere other than it is now (e.g., in a new branch of the existing repository).
25+
26+
## Non-goals
27+
28+
Evaluation of RAG will be addressed in one or more other development documents. That topic is out of scope for this document.
29+
30+
## Decision
31+
32+
- For now, RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
33+
34+
## How
35+
36+
### Phase 1
37+
38+
- RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
39+
- This directory will include all of the following:
40+
- Loading the content from Docling-format JSON files (that are produced by SDG preprocessing).
41+
- Chunking that content to sizes that fit the requirements of the selected embedding model for vector database storage and retrieval.
42+
- Storing those chunks with their vector representations in a vector database.
43+
- End-to-end runtime RAG. The initial version of this includes the following:
44+
- Taking as input a session history (including a current user query) and providing a response (e.g., something along the lines of the [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create)).
45+
- During that processing, it retrieves relevant search results from the vector database, it converts those into a prompt to send to the response generation model, it prompts that model, and it returns the response from that model.
46+
- This will be invoked from the existing `ilab` CLI, as described in the [RAG ingestion and chat pipelines](https://github.com/instructlab/dev-docs/pull/161) dev doc.
47+
48+
### Future phases
49+
50+
- In the near future, RAG might be moved to the existing <https://github.com/instructlab/rag> repository.
51+
- If so, something will be done with the existing code in <https://github.com/instructlab/rag>, e.g., moving it to a branch of that repository or moving it to a different repository.
52+
- Alternatively, some or all of it might move to a new repository.
53+
- For example, maybe the indexing and retrieval portions move to a separate retrieval repository while the rest of end-to-end runtime RAG might move somewhere else.
54+
- If/when we move ahead with any of these options, *we will open a new ADR for that decision*.
55+
- Also, the capabilities will keep improving and adding more functionality.
56+
57+
## Alternatives
58+
59+
- Put the indexing and run-time RAG code in a new repository.
60+
- Pro: Having a dedicated repository gives the RAG team the most freedom and flexibility to make technical decisions that work for that team.
61+
- Pro: Starting with a new repository provides a blank slate that can be set up in whatever way makes the most sense for that functionality.
62+
- Pro: Having the capability in one repository makes it easier for consumers such as RamaLama to reuse it for their purposes too.
63+
- Con: Creating and configuring a new repository is some work. (This is a fairly small con, but a real one.)
64+
- Con: Integrating a new repository into the continuous integration and delivery capabilities for both upstream InstructLab and downstream consumers is a *lot* of work. This is a much bigger con.
65+
- Con: All that extra work would almost certainly result in slower time to market. This risks missing some market opportunities.
66+
- Put the indexing code in <https://github.com/instructlab/sdg> (SDG) and the run-time RAG code in <https://github.com/instructlab/instructlab> (core)
67+
- Pro: This has the advantage of not adding any new dependencies.
68+
- Pro: The document processing is already in SDG and chat functionality is already in core so this would require the fewest code changes.
69+
- Con: Splitting the RAG functionality across multiple repositories makes it more complicated to reuse in other applications outside of InstructLab.
70+
- Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG. If those components are split across multiple repositories, that will make delivering such changes more complicated.
71+
- Start by putting the code into existing InstructLab repositories (either of the above options) and then split if off into its own repository later.
72+
- Pro: Gets us integrated into InstructLab sooner.
73+
- Con: Adds extra work to the second phase where we have to split it off into its own repository.
74+
- Con: There is a risk that we never get around to splitting it off and we wind up stuck with the cons of being jammed in to other components indefinitely.
75+
- Put the indexing and run-time RAG code in a new repo outside <https://github.com/instructlab/>.
76+
- Pro: This signals that this is not specific to InstructLab but is instead intended to be useful in a variety of applications. That makes it more likely the work could have broader impact.
77+
- Con: If we put this out there as something that is intended to be useful in a variety of applications, the pressure is on us to make sure it is differentiated from other broadly applicable RAG capabilities. Hopefully that will be true eventually, but it probably won't be true for a while. It might make more sense to give this some time to mature as a local component of InstructLab before trying to spin it off as its own thing.
78+
- Con: If we put it out there as its own open source project, that project needs all of the infrastructure of a full open source activity (governing structures, communication tools and protocols, etc.). That's a lot of work to set up. Keeping it inside InstructLab for now lets us keep using the infrastructure that InstructLab has for this purpose).
79+
- Con: If we put it out there as its own open source project, it needs a name. It is a lot of work to come up with a good name and there will be a lot of stakeholders with an interest in the name that comes up.
80+
- Keep the indexing and run-time RAG code in <https://github.com/redhat-et/PaRAGon> which is an emerging technologies prototype for this work.
81+
- Mostly the same pros and cons as putting it in a new repo outside InstructLab plus the following:
82+
- Pro: A prototype for the code we want is already there.
83+
- Pro: It already has its own distinctive name (PaRAGon).
84+
- Con: The existing repository has its own simple command-line interface which is useful for the prototype but we don't want it in the capability we release because too many command-line interfaces will confuse users.
85+
- Con: The name PaRAGon seems fine to me, but probably more stakeholders need to weigh in on what a name would be.
86+
- Con: The `redhat-et` label suggests that this is something "owned" by Red Hat which makes sense for the prototype but not so much for something we want a community to own in the long run.
87+
- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND keep the existing RAG functionality in that repository intact.
88+
- Pro: It already exists.
89+
- Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
90+
- Con: It creates the confusion of having two different RAG solutions in the same repository. We could mitigate that with developer documentation and marking legacy stuff as "deprecated".
91+
- Put the indexing and run-time RAG code in <https://github.com/instructlab/rag> AND eliminate the existing RAG functionality in that repository.
92+
- Pro: It already exists.
93+
- Pro: It avoids the confusion of having two different RAG repositories in <https://github.com/instructlab/>.
94+
- Pro: It avoids the confusion of having two different RAG solutions since we'd be eliminating the old one.
95+
- Con: There is still some interest in keeping this around.
96+
97+
## Risks
98+
99+
- Putting the RAG functionality in the core repository requires any application that wants to use this functionality to bring in the entire core which then brings in all of the libraries it depend on, so this becomes an enormous dependency. This discourages reuse in other applications. It *encourages* either of the following behaviors that would be unfortunate:
100+
- Other applications pull directly from <https://github.com/redhat-et/PaRAGon> and in doing so duplicate the ongoing effort to harden that code base.
101+
- Other applications may implement their own RAG solutions or pull from some other upstream unrelated to ours.
102+
- As noted earlier, putting the capability inside <https://github.com/instructlab/> signals that this is a component of InstructLab and not a generally useful feature. That creates a risk that the work could miss out on additional opportunities for impact. We hope to mitigate that risk by spinning it off to its own open source project when it is mature enough, but there is a risk that we will get distracted by other things and never get around to this.
103+
- The flow for document processing for InstructLab winds up being quite complicated in this proposal. Since the existing document processing is in SDG, the flow for indexing for RAG winds up being a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core `/data` directory which then calls out the the `core/rag` directory for chunking and vector database indexing). Having the document processing move from core to SDG and back to core and forward to RAG makes that capability more difficult to understand and maintain. This complexity will be partially mitigated when the preprocessing code moves from SDG to core. It will be further mitigated by having a clear, well-documented contract between core and the RAG repository indicating the responsibilities of each.
104+
105+
## References
106+
107+
- <https://github.com/redhat-et/PaRAGon>
108+
- <https://github.com/instructlab>
109+
- <https://github.com/instructlab/rag>

0 commit comments

Comments
 (0)