llads

'Large Language Data and Statistics'. A library to generate LLM insights to data.

Installation

Install the package from PyPI, as well as the libraries in requirements.txt.

Usage

LLM

You can use any LLM that works with the OpenAI API syntax, including a local LlamaCPP server. Note that the LLM needs to be powerful enough to properly parse and produce the expected outputs for the various steps of the chain. The following information is necessary for creating the LLM:

api key (if using a cloud LLM provider)
base url
model name

Example

import pandas as pd

from llads.customLLM import customLLM
import llads.tools # optionally import the module where your tools are to get proper imports for full runnable Python script generation
import llads.visualizations
from llads.tools import get_world_bank_gdp_data # this is a custom tool included as an example. You can define and pass your own tools
from llads.visualizations import gen_plot # this is a line/bar plot visualization tool included as an example. You can define and pass your own visualization tools

system_prompts = pd.read_csv("https://raw.githubusercontent.com/dhopp1/llads/refs/heads/main/system_prompts.csv") # a good default is included in the repo, but you can edit to your own needs

# creating the LLM (gemini 2.0 flash as an example)
llm = customLLM(
        api_key="API_KEY",
        base_url="https://generativelanguage.googleapis.com/v1beta/openai",
        model_name="gemini-2.0-flash",
        temperature=0.0,
        max_tokens=2048,
        system_prompts=system_prompts,
)

# defining which tools the LLM has available to it
tools = [get_world_bank_gdp_data]
plot_tools = [gen_plot]

# generating a response
prompt = "What is the GDP of Italy and the UK as a % of Germany over the last 5 years?" # the user's initial question
results = llm.chat(
	prompt=prompt, 
	tools=tools, 
	plot_tools=plot_tools, 
	validate=True, # if True, the LLM will perform an additional validation step on its commentary
	use_free_plot=False, # if False, the LLM will have to use one of the plot_tools, if True, it will be free to make its own matplotlib plot
	prior_query_id=None, # None, because this is the first query in the chat history
	modules = [llads.tools, llads.visualizations], # optionally pass the modules where your tool and plot functions are to get proper imports for full runnable python script
)

# follow-up question
new_query = "Add France to the analysis"
followup_result = llm.chat(
	prompt=new_query, 
	tools=tools, 
	plot_tools=plot_tools, 
	validate=True,
	use_free_plot=False,
	prior_query_id=results["tool_result"]["query_id"], # pass our prior query id to make message history available
)

# you can access prior query results via the query id
llm._query_results[query_id]

Interpreting output

The chat() function will produce a dictionary with the following values:

initial_prompt: The question passed by the user
tool_result: A dictionary with the following information:
- query_id: The unique ID number of this query
- tool_call: The name and arguments of the tools the LLM called
- invoked_result: The actual DataFrame resulting from the tool calls
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
pd_code: A dictionary with the following information:
- data_desc: A text description of the data made available to the LLM via the tool call
- pd_code: The Python code the LLM executed to edit the raw data available to it
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
dataset: The actual DataFrame that is the result of the pd_code call
explanation: A dictionary with the following information:
- explanation: The LLM's explanation of the data manipulation process undergone to answer the user's question
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
commentary: A dictionary with the following information:
- commentary: The LLM's commentary on the final dataset answering the user's question
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
plots: A dictionary with the following values:
- visualization_call: A list of either the matplotlib code written or the plotting function call run to create the visualization to answer the user's question
- invoked_result: A list of the actual plot figures produced to answer the user's question
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
context_rich_prompt: The prompt passed to the LLM containing the prior context. Empty string if it's the first question in the chat.
python_script: The full, self-contained, runnable Python script that duplicates the entire pipeline.

Explanation of steps/chain

Note that for each step in the chain, you can provide additional context information specifically for that step by passing a string to any of the following arguments in the chat() function:

- `addt_context_gen_tool_call`
- `addt_context_gen_pandas_df`
- `addt_context_explain_pandas_df`
- `addt_context_gen_final_commentary`
- `addt_context_gen_plot_call`

The LLM determines which raw data functions it wants to call with which arguments via the llm.gen_tool_call() function, the calls and generates the raw datasets.
Given the raw data available from the previous step, the llm.gen_pandas_df() produces Python code to create a final result dataset.
The LLM explains the data transformation steps via the llm.explain_pandas_df() function.
The LLM is given the final full result dataset and writes commentary answering the user's question via the llm.gen_final_commentary() funcrtion.
If validate=True in the llm.gen_final_commentary() call, the LLM performs a validation step on its commentary to look for and correct errors.
The LLM produces a visualization to help answer the user's question, via either the llm.gen_free_plot() function (if use_free_plot=True) or the llm.gen_plot_call() function. The former allows the LLM to create any Matplotlib plot, the latter restricts it to calling one of the predefined visualization tools. Useful if you want to customize style, etc.

If a prior_query_id is passed, at the very beginning of the pipeline the user's prompt will be augmented with the full context history of previous messages, including tool calls, data manipulation steps, commentary provided, and visualizations created.

Defining your own datasets/tools

The library contains the get_world_bank_gdp_data function as an example. To make additional data available to the LLM, you can define your own tools. For example, say we wanted to add a simple addition tool:

from langchain_core.tools import tool

@tool
def add(first_int: int, second_int: int) -> int:
    "Add two integers."
    return first_int + second_int
    
tools = [add, get_world_bank_gdp_data] # the LLM will now be able to choose either the addition tool, or the World Bank GDP tool.

As long as the input and outputs of the function are well defined, the LLM should be able to use it if helpful to answer a user's question.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
llads		llads
LICENSE		LICENSE
README.md		README.md
api.py		api.py
requirements.txt		requirements.txt
setup.py		setup.py
system_prompts.csv		system_prompts.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llads

Installation

Usage

LLM

Example

Interpreting output

Explanation of steps/chain

Defining your own datasets/tools

About

Uh oh!

Releases

Packages

Languages

License

martineosc/llads

Folders and files

Latest commit

History

Repository files navigation

llads

Installation

Usage

LLM

Example

Interpreting output

Explanation of steps/chain

Defining your own datasets/tools

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages