'Large Language Data and Statistics'. A library to generate LLM insights to data.
Install the package from PyPI, as well as the libraries in requirements.txt.
You can use any LLM that works with the OpenAI API syntax, including a local LlamaCPP server. Note that the LLM needs to be powerful enough to properly parse and produce the expected outputs for the various steps of the chain. The following information is necessary for creating the LLM:
- api key (if using a cloud LLM provider)
- base url
- model name
import pandas as pd
from llads.customLLM import customLLM
import llads.tools # optionally import the module where your tools are to get proper imports for full runnable Python script generation
import llads.visualizations
from llads.tools import get_world_bank_gdp_data # this is a custom tool included as an example. You can define and pass your own tools
from llads.visualizations import gen_plot # this is a line/bar plot visualization tool included as an example. You can define and pass your own visualization tools
system_prompts = pd.read_csv("https://raw.githubusercontent.com/dhopp1/llads/refs/heads/main/system_prompts.csv") # a good default is included in the repo, but you can edit to your own needs
# creating the LLM (gemini 2.0 flash as an example)
llm = customLLM(
api_key="API_KEY",
base_url="https://generativelanguage.googleapis.com/v1beta/openai",
model_name="gemini-2.0-flash",
temperature=0.0,
max_tokens=2048,
system_prompts=system_prompts,
)
# defining which tools the LLM has available to it
tools = [get_world_bank_gdp_data]
plot_tools = [gen_plot]
# generating a response
prompt = "What is the GDP of Italy and the UK as a % of Germany over the last 5 years?" # the user's initial question
results = llm.chat(
prompt=prompt,
tools=tools,
plot_tools=plot_tools,
validate=True, # if True, the LLM will perform an additional validation step on its commentary
use_free_plot=False, # if False, the LLM will have to use one of the plot_tools, if True, it will be free to make its own matplotlib plot
prior_query_id=None, # None, because this is the first query in the chat history
modules = [llads.tools, llads.visualizations], # optionally pass the modules where your tool and plot functions are to get proper imports for full runnable python script
)
# follow-up question
new_query = "Add France to the analysis"
followup_result = llm.chat(
prompt=new_query,
tools=tools,
plot_tools=plot_tools,
validate=True,
use_free_plot=False,
prior_query_id=results["tool_result"]["query_id"], # pass our prior query id to make message history available
)
# you can access prior query results via the query id
llm._query_results[query_id]The chat() function will produce a dictionary with the following values:
-
initial_prompt: The question passed by the user
-
tool_result: A dictionary with the following information:
- query_id: The unique ID number of this query
- tool_call: The name and arguments of the tools the LLM called
- invoked_result: The actual DataFrame resulting from the tool calls
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
-
pd_code: A dictionary with the following information:
- data_desc: A text description of the data made available to the LLM via the tool call
- pd_code: The Python code the LLM executed to edit the raw data available to it
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
-
dataset: The actual DataFrame that is the result of the
pd_codecall -
explanation: A dictionary with the following information:
- explanation: The LLM's explanation of the data manipulation process undergone to answer the user's question
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
-
commentary: A dictionary with the following information:
- commentary: The LLM's commentary on the final dataset answering the user's question
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
-
plots: A dictionary with the following values:
- visualization_call: A list of either the matplotlib code written or the plotting function call run to create the visualization to answer the user's question
- invoked_result: A list of the actual plot figures produced to answer the user's question
- n_tokens_input: The number of tokens consumed by the LLM for input in this step
- n_tokens_output: The number of tokens consumed by the LLM for output in this step
- seconds_taken: How many seconds this step took to run
-
context_rich_prompt: The prompt passed to the LLM containing the prior context. Empty string if it's the first question in the chat.
-
python_script: The full, self-contained, runnable Python script that duplicates the entire pipeline.
Note that for each step in the chain, you can provide additional context information specifically for that step by passing a string to any of the following arguments in the chat() function:
- `addt_context_gen_tool_call`
- `addt_context_gen_pandas_df`
- `addt_context_explain_pandas_df`
- `addt_context_gen_final_commentary`
- `addt_context_gen_plot_call`
- The LLM determines which raw data functions it wants to call with which arguments via the
llm.gen_tool_call()function, the calls and generates the raw datasets. - Given the raw data available from the previous step, the
llm.gen_pandas_df()produces Python code to create a final result dataset. - The LLM explains the data transformation steps via the
llm.explain_pandas_df()function. - The LLM is given the final full result dataset and writes commentary answering the user's question via the
llm.gen_final_commentary()funcrtion. - If
validate=Truein thellm.gen_final_commentary()call, the LLM performs a validation step on its commentary to look for and correct errors. - The LLM produces a visualization to help answer the user's question, via either the
llm.gen_free_plot()function (ifuse_free_plot=True) or thellm.gen_plot_call()function. The former allows the LLM to create any Matplotlib plot, the latter restricts it to calling one of the predefined visualization tools. Useful if you want to customize style, etc.
If a prior_query_id is passed, at the very beginning of the pipeline the user's prompt will be augmented with the full context history of previous messages, including tool calls, data manipulation steps, commentary provided, and visualizations created.
The library contains the get_world_bank_gdp_data function as an example. To make additional data available to the LLM, you can define your own tools. For example, say we wanted to add a simple addition tool:
from langchain_core.tools import tool
@tool
def add(first_int: int, second_int: int) -> int:
"Add two integers."
return first_int + second_int
tools = [add, get_world_bank_gdp_data] # the LLM will now be able to choose either the addition tool, or the World Bank GDP tool.As long as the input and outputs of the function are well defined, the LLM should be able to use it if helpful to answer a user's question.