diff --git a/.gitmodules b/.gitmodules
index f9c7d6c..a39a28c 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,3 @@
 [submodule "autolabel"]
-	path = autolabel
+	path = docs/autolabel
 	url = https://github.com/refuel-ai/autolabel.git
diff --git a/autolabel b/docs/autolabel
similarity index 100%
rename from autolabel
rename to docs/autolabel
diff --git a/docs/autolabel/concepts/concepts.md b/docs/autolabel/concepts/concepts.md
deleted file mode 100644
index 863d354..0000000
--- a/docs/autolabel/concepts/concepts.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Modules
-
-On this page, we will talk about the different pages that exist in Autolabel. We will first discuss the overview of a module and then go into the different subheadings, expanding and giving some examples for each.
-
-## Prompts
-
-Writing prompts is a crucial aspect of training language models for specific tasks. In this tutorial, we will explore the five essential parts of a prompt: the prefix prompt, task prompt, output prompt, seed examples, and current example. Understanding and constructing these components effectively can help guide the model's behavior and generate accurate and contextually appropriate responses. Let's delve into each part in detail.
-
-### Prefix Prompt
-The prefix prompt is the initial line of the prompt, which sets the domain and provides task-independent information to the model. It helps the model understand the specific area or expertise it should embody while generating responses. For example, if the prefix prompt indicates a medical domain, the model will focus on generating responses that align with medical knowledge and terminology.  
-Example:  
-[Medical] In this prompt, the model should provide expert advice on diagnosing and treating common ailments.
-
-### Task Prompt
-The task prompt explains the objective or task the model needs to accomplish. It describes the specific instructions or guidelines for completing the task. This section is crucial for clearly conveying the desired output from the model.  
-Example:  
-You are a medical expert. Given a patient's symptoms and medical history, provide a diagnosis and recommend appropriate treatment options.
-
-### Output Prompt
-The output prompt informs the model about the expected answer format or structure. It defines the specific format in which the model should provide the answer. This step ensures consistency and enables easier processing of the model's responses.
-Example:  
-Provide the diagnosis and treatment recommendations in JSON format, with the following keys: "diagnosis" and "treatment." The value for each key should be a string representing the diagnosis and treatment, respectively.
-
-### Seed Examples
-Seed examples play a vital role in training the model by providing real-world examples from the task distribution. These examples help the model grasp the nature of the task, understand the expected outputs, and align its behavior accordingly. It is crucial to provide meaningful and diverse seed examples to facilitate accurate responses.
-Example:  
-Seed Examples:  
-
-Patient: Fever, sore throat, and fatigue. Medical History: None.  
-Diagnosis: "Common cold"  
-Treatment: "Rest, plenty of fluids, and over-the-counter cold medication."  
-Patient: Persistent cough, shortness of breath, and wheezing. Medical History: Asthma.  
-Diagnosis: "Asthma exacerbation"  
-Treatment: "Inhaled bronchodilators and corticosteroids as prescribed."
-
-### Current Example
-The current example is the specific instance for which you seek the model's response. It provides the exact answer or label you want the model to assign to this particular example.  
-Example:  
-Current Example:  
-Patient: Severe headache, visual disturbances, and nausea. Medical History: None.  
-Desired Diagnosis: "Migraine"  
-Desired Treatment: "Prescribed pain-relief medication and lifestyle modifications."  
-
-## Configs
-
-There are 3 modules required by every labeling run -
-1. A task
-2. An LLM
-3. A dataset
-
-All 3 of these modules can be instantiated with configs. A config can be passed in as a dictionary or as the path to a json file. The config consists of different keys and the following section will list out each key along with the property of the module that it affects.
-
-### Config
-
-The Config class is used to parse, validate, and store information about the labeling task being performed.
-
-::: src.autolabel.configs.config.AutolabelConfig
-    rendering:
-        show_root_full_path: no
-        heading_level: 4
-
-## Tasks
-
-### Classification
-### Question Answering
-### Entity matching
-### Named Entity Recognition
diff --git a/docs/autolabel/guide/accuracy/chain-of-thought.md b/docs/autolabel/guide/accuracy/chain-of-thought.md
deleted file mode 100644
index 108048e..0000000
--- a/docs/autolabel/guide/accuracy/chain-of-thought.md
+++ /dev/null
@@ -1,86 +0,0 @@
-<figure markdown>
-  ![Chain-of-Thought prompting](../../../assets/standardvscotprompt.png){ width="600" }
-  <figcaption>Chain of Thought Prompting (Wei et al)</figcaption>
-</figure>
-
-LLMs find it hard to perform well on complex reasoning tasks. We can unlock the reasoning abilities of LLMs using chain of thought prompting. This involves asking the LLM to produce the reasoning before producing the answer (roughly analogous to "show me your work").
-
-Chain of thought makes LLMs more effective at reasoning tasks like mathematical word problems, commonsense reasoning questions and complex medical questions. It also provides a window into the thought process of the LLM, though some research points the link between the generated explanation and the final answer may be weak.
-
-## Using Chain Of Thought in Autolabel [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GYs0_4k8vhGk1LOJISppNN98DRq_Bur1#scrollTo=6xqMfKxa92Sj)
-
-Enabling chain-of-thought prompting for your task is straightforward with Autolabel. It works best when provided with a few seed examples with explanations. Thus enabling chain of thought requires a few things:
-
-1. Setting `chain_of_thought` flag in the labeling config.
-2. Providing explanations or generating explanations for your seed examples automatically by using an LLM
-3. Setting the `explanation_column` in the labeling config.
-4. Altering the task guidelines and `example_template` to tell the model to generate an explanation before generating the final answer.
-
-We will go through using chain of thought on a dataset where it shows improvement, like the SQuAD question answering dataset.
-
-Let's see a datapoint before there is any explanation added to it.
-
-{{ read_csv('docs/assets/squad_preview.csv') }}
-
-Now we can manually write the explanation for this or a couple of seed examples easily. But this will be tiresome for > 10 examples. LLMs come to the rescue yet again! We can just define the config and ask the agent to generate explanations as well!
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "explanation_column": "explanation",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. Use the context to answer the question - the answer is a continuous span of words from the context.\n",
-        "output_guidelines": "Your answer will consist of an explanation, followed by the correct answer. The last line of the response should always be is JSON format with one key: {\"label\": \"the correct answer\"}.\n If the question cannot be answered using the context and the context alone without any outside knowledge, the question is unanswerable. If the question is unanswerable, return the answer as {\"label\": \"unanswerable\"}\n",
-        "few_shot_examples": "seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 3,
-        "example_template": "Context: {context}\nQuestion: {question}\nAnswer: Let's think step by step.\n{explanation}\n{answer}",
-        "chain_of_thought": True
-    }
-}
-```
-
-Notice the changes that we have made to the config compared to the config without Chain-of-Thought [here](../tasks/question_answering_task.md):
-
-- `chain_of_thought` flag - this tells labeling agent to expect an explanation for the answer, in the seed dataset as well as LLM generated responses.
-- `explanation_column` - this is the column where the explanation for the seed examples will reside.
-- `example_template` - Notice that the template contains contains the explanation column as well. This tells the config where the explanation should be put when using the seed examples. We use the `Let's think step by step` prompt to initiate the chain of thought in the model.
-- `output_guidelines` - We are explicitly prompting the LLM to first output an explanation, and then the final answer.
-
-Now, in order to generate explanations for the seed examples, in case they were not manually generated is,
-
-```py
-from autolabel import LabelingAgent
-agent = LabelingAgent(config)
-agent.generate_explanations("path_to_seed_examples.csv")
-```
-
-Once these explanations are generated, the dataset looks like
-
-{{ read_csv('docs/assets/squad_with_explanation_preview.csv') }}
-
-Now to generate labels for this dataset, all we have to do is,
-
-```py
-from autolabel import AutolabelDataset
-ds = AutolabelDatset('data/squad_v2_test.csv', config = config)
-agent.plan(ds)
-agent.run(ds, max_items = 100)
-```
-
-Autolabel currently supports Chain-of-thought prompting for the following tasks:
-
-1. Classifcation ([example](https://github.com/refuel-ai/autolabel/blob/main/examples/civil_comments/example_civil_comments.ipynb))
-2. Entity Match
-3. Question Answering ([example](https://github.com/refuel-ai/autolabel/blob/main/examples/squad_v2/example_squad_v2.ipynb))
-
-Support for other tasks coming soon!
diff --git a/docs/autolabel/guide/accuracy/confidence.md b/docs/autolabel/guide/accuracy/confidence.md
deleted file mode 100644
index e531cda..0000000
--- a/docs/autolabel/guide/accuracy/confidence.md
+++ /dev/null
@@ -1,63 +0,0 @@
-# Confidence
-
-<figure markdown>
-  ![ChatGPT summarizing a non-existent New York Times article even without access to the Internet](https://upload.wikimedia.org/wikipedia/commons/3/3a/ChatGPT_hallucination.png){ width="600" }
-  <figcaption>ChatGPT summarizing a non-existent New York Times article even without access to the Internet</figcaption>
-</figure>
-
-One of the biggest criticisms of using a LLMs so far has been hallucinations - LLMs can seem very confidence in their language even when they are completely incorrect. `autolabel` provides a confidence score for each LLM output that is correlated with the likelihood of that output being incorrect, i.e. if the confidence score is high, then it is more likely that the output is correct, and if confidence score is low, it is likely that the LLM has produced an incorrect output.
-
-## Computing Confidence Scores
-
-The `autolabel` library today relies on token level probabilities, also known as logprobs, to compute confidence scores. However, very few models today return token level probabilities alongside prediction. Out of all models supported by `autolabel` today, only the `text-davinci-003` model by `openai` can return logprobs. For all other models, Refuel has setup an in-house API to generate logprobs for a specific prediction given an input, regardless of the language model that was originally used to query for the prediction. For `text-davinci-003`, we use the logprobs returned by `openai`'s API instead of querying our in-house API.
-
-Generating confidence scores is simple - setting the key `compute_confidence` to `True` in the `model` dictionary of the config should initiate confidence score retrieval. Here is an example:
-
-```python
-{
-    "task_name": "PersonLocationOrgMiscNER",
-    "task_type": "named_entity_recognition",
-    "dataset": {
-        "label_column": "CategorizedLabels",
-        "text_column": "example",
-        "delimiter": "%"
-    },
-    "model": {
-        "provider": "anthropic",
-        "name": "claude-v1",
-        "compute_confidence": True
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at extracting entities from text.",
-        "labels": [
-            "Location",
-            "Organization",
-            "Person",
-            "Miscellaneous"
-        ],
-        "example_template": "Example: {example}\nOutput: {CategorizedLabels}",
-        "few_shot_examples": "../examples/conll2003/seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 5
-    }
-}
-```
-
-In the above example, by setting `compute_confidence` to True, `autolabel` will start calling Refuel's api to generate token level probabilities and compute confidence scores for each prediction. In order for this to run successfully, ensure that the following setup has been completed:
-
-Set the following environment variable:
-```
-export REFUEL_API_KEY=<your-refuel-key>
-```
-replacing `<your-refuel-key>` with your API key, which you can get from [here](https://refuel-ai.typeform.com/llm-access)
-
-## Interpreting Scores
-
-To see how confidence scores can be used to make a tradeoff between task performance and completion rate, let's take a look at the following example:
-
-<figure markdown>
-  ![Confidence Output for Conll](confidence.png){ width="600" }
-  <figcaption>Library output when confidence is enabled</figcaption>
-</figure>
-
-`autolabel` outputs a table consisting of metrics at various confidence thresholds when `compute_confidence` is set to `True`. Specifically, this is the table we get when we label 100 examples from the CONLL-2003 dataset with semantic similarity enabled. The first row in the table corresponds to the overall performance: we were able to successfully label 98% of examples at an F1 score of 0.885. However, we can use this table to decide on a confidence threshold to accept predictions at and increase our metrics. For example, note that according the highlighed row, if we accept labels with confidence scores above ~2.207, we can boost our F1 score to 0.95 while reducing completion rate to 79%. 
\ No newline at end of file
diff --git a/docs/autolabel/guide/accuracy/confidence.png b/docs/autolabel/guide/accuracy/confidence.png
deleted file mode 100644
index 3dbfd38..0000000
Binary files a/docs/autolabel/guide/accuracy/confidence.png and /dev/null differ
diff --git a/docs/autolabel/guide/accuracy/few-shot.md b/docs/autolabel/guide/accuracy/few-shot.md
deleted file mode 100644
index 0d677b9..0000000
--- a/docs/autolabel/guide/accuracy/few-shot.md
+++ /dev/null
@@ -1,256 +0,0 @@
-# Few-shot Prompting
-
-It has been shown that the specific seed examples used while constructing the prompt have an impact on the performance of the model. Seed examples are the labeled examples which are shown as demonstration to the LLM to help it understand the task better. Optimally selecting the seed examples can help boost performance and save on labeling costs by reducing the context size.
-
-We support the following few-shot example selection techniques:
-
-1. **Fixed** - The same set of seed examples are used for every input data point.
-2. **Semantic_similarity** - Embeddings are computed for all the examples in the seed set and a vector similarity search finds the few shot examples which are closest to the input datapoint. Closer datapoints from the seed set can give the model more context on how similar examples have been labeled, helping it improve performance.
-3. **Max_marginal_relevance** - Semantic similarity search is used to retrieve a set of candidate examples. Then, a diversity-driven selection strategy is used amongst these candidates to select a final subset of examples that have the most coverage of the initial pool of candidate examples.
-4. **Label diversity** - This strategy focuses on ensuring that the few-shot examples selected provide coverage across all the valid output labels.
-5. **Label diversity with similarity** - This strategy is a combination of (2) and (4) above - it samples a fixed number of examples per valid label, and within each label it selects the examples that are most similar to the input.
-
-Example:
-
-[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qgfy7odvkCNKrB58ozAF4qXzu10rRGKx#scrollTo=x0js54dB0D7J)
-
-Consider the following labeling runs for a classification task on the banking dataset. There are a total of 1998 items to be labeled and we assume a starting labeled seedset of 200 examples. Here is the config to label this dataset in zero-shot fashion:
-
-```py
-config_zero_shot = {
-    "task_name": "BankingComplaintsClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at understanding bank customers support complaints and queries.\nYour job is to correctly classify the provided input example into one of the following categories.\nCategories:\n{labels}",
-        "output_guidelines": "You will answer with just the the correct output label and nothing else.",
-        "labels": [
-            "activate_my_card",
-            "age_limit",
-            "apple_pay_or_google_pay",
-            "atm_support",
-            "automatic_top_up",
-            "balance_not_updated_after_bank_transfer",
-            "balance_not_updated_after_cheque_or_cash_deposit",
-            "beneficiary_not_allowed",
-            "cancel_transfer",
-            "card_about_to_expire",
-            "card_acceptance",
-            "card_arrival",
-            "card_delivery_estimate",
-            "card_linking",
-            "card_not_working",
-            "card_payment_fee_charged",
-            "card_payment_not_recognised",
-            "card_payment_wrong_exchange_rate",
-            "card_swallowed",
-            "cash_withdrawal_charge",
-            "cash_withdrawal_not_recognised",
-            "change_pin",
-            "compromised_card",
-            "contactless_not_working",
-            "country_support",
-            "declined_card_payment",
-            "declined_cash_withdrawal",
-            "declined_transfer",
-            "direct_debit_payment_not_recognised",
-            "disposable_card_limits",
-            "edit_personal_details",
-            "exchange_charge",
-            "exchange_rate",
-            "exchange_via_app",
-            "extra_charge_on_statement",
-            "failed_transfer",
-            "fiat_currency_support",
-            "get_disposable_virtual_card",
-            "get_physical_card",
-            "getting_spare_card",
-            "getting_virtual_card",
-            "lost_or_stolen_card",
-            "lost_or_stolen_phone",
-            "order_physical_card",
-            "passcode_forgotten",
-            "pending_card_payment",
-            "pending_cash_withdrawal",
-            "pending_top_up",
-            "pending_transfer",
-            "pin_blocked",
-            "receiving_money",
-            "Refund_not_showing_up",
-            "request_refund",
-            "reverted_card_payment?",
-            "supported_cards_and_currencies",
-            "terminate_account",
-            "top_up_by_bank_transfer_charge",
-            "top_up_by_card_charge",
-            "top_up_by_cash_or_cheque",
-            "top_up_failed",
-            "top_up_limits",
-            "top_up_reverted",
-            "topping_up_by_card",
-            "transaction_charged_twice",
-            "transfer_fee_charged",
-            "transfer_into_account",
-            "transfer_not_received_by_recipient",
-            "transfer_timing",
-            "unable_to_verify_identity",
-            "verify_my_identity",
-            "verify_source_of_funds",
-            "verify_top_up",
-            "virtual_card_not_working",
-            "visa_or_mastercard",
-            "why_verify_identity",
-            "wrong_amount_of_cash_received",
-            "wrong_exchange_rate_for_cash_withdrawal"
-        ],
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-
-agent = LabelingAgent(config=config_zero_shot)
-ds = AutolabelDataset('../examples/banking/test.csv', config = config_zero_shot)
-labeled_dataset = agent.run(ds)
-```
-
-This zero-shot task execution results in an accuracy of 70.19%.
-
-Iterating on this, we compare a fixed few-shot example selection strategy, which randomly chooses k examples from the labeled seedset and appends these same k examples to each prompt for the 1998 items to be labeled. In this case, we use k=10 seed examples per prompt. To use this selection strategy, we need to modify the config:
-
-```py
-config_fixed_few_shot = {
-    "task_name": "BankingComplaintsClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        ...
-        "few_shot_examples": "../examples/banking/seed.csv",
-        "few_shot_selection": "fixed",
-        "few_shot_num": 10,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-```py
-agent = LabelingAgent(config=config_fixed_few_shot)
-ds = AutolabelDataset('../examples/banking/test.csv', config = config_fixed_few_shot)
-labeled_dataset = agent.run(ds)
-```
-
-This leads to an accuracy of 73.16%, an improvement of ~3% over the zero-shot baseline.
-
-Finally, we compare a semantic similarity example selection strategy, which computes a text embedding for each of the 200 labeled seedset examples. Then, for each of the 1998 items to be labeled, we compute a text embedding and find the k most similar examples from the labeled seedset and append those k examples to the prompt for the current example. This leads to custom examples used for each item to be labeled, with the idea being that more similar examples and their corresponding labels may assist the LLM in labeling. Here is the config change to use semantic similarity as the example selection strategy:
-
-```py
-config_semantic_similarity = {
-    "task_name": "BankingComplaintsClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        ...
-        "few_shot_examples": "../examples/banking/seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 10,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-```py
-agent = LabelingAgent(config=config_semantic_similarity)
-ds = AutolabelDataset('../examples/banking/test.csv', config = config_semantic_similarity)
-labeled_dataset = agent.run(ds)
-```
-
-With semantic similarity example selection, we obtain a 79.02% accuracy, a significant increase of ~6% over the fixed-shot strategy.
-
-Finally, let's take a look at label diversity set of example selection techniques in action:
-
-```py
-config_label_diversity_random = {
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        ...
-        "few_shot_examples": "../examples/civil_comments/seed.csv",
-        "few_shot_selection": "label_diversity_random",
-        "few_shot_num": 5,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-```py
-agent = LabelingAgent(config=config_label_diversity_random)
-ds = AutolabelDataset('../examples/civil_comments/test.csv', config = config_label_diversity_random)
-labeled_dataset = agent.run(ds, max_items=200)
-```
-
-```py
-config_label_diversity_similarity = {
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        ...
-        "few_shot_examples": "../examples/civil_comments/seed.csv",
-        "few_shot_selection": "label_diversity_similarity",
-        "few_shot_num": 5,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-```py
-agent = LabelingAgent(config=config_label_diversity_similarity)
-ds = AutolabelDataset('../examples/civil_comments/test.csv', config = config_label_diversity_similarity)
-labeled_dataset = agent.run(ds, max_items=200)
-```
-
-For this run on the civil comments dataset, label diversity at random achieved 80% accuracy and label diversity with semantic similarity achieved 78% accuracy. For the same subset of data, the use of regular semantic similarity example selection obtained 72% accuracy, making for a significant improvement by using label diversity. 
-
-Label diversity example selection strategies are likely best suited for labeling tasks with a small number of unique labels, which is the case for the civil comments dataset with only 2 labels. This is because equal representation of all the possible labels may be less likely to bias the LLM towards a particular label.
-
-By default, Autolabel uses OpenAI to compute text embeddings for few shot example selection strategies that require them (semantic similarity, max marginal relevance). However, Autolabel also supports alternative embedding model providers such as Google Vertex AI and Huggingface as outlined [here](../llms/embeddings.md).
-
-It is almost always advisable to use an example selection strategy over a zero-shot approach in your autolabeling workflows, but the choice of which example selection strategy to use is dependent upon the specific labeling task and dataset.
diff --git a/docs/autolabel/guide/accuracy/prompting-better.md b/docs/autolabel/guide/accuracy/prompting-better.md
deleted file mode 100644
index a86a754..0000000
--- a/docs/autolabel/guide/accuracy/prompting-better.md
+++ /dev/null
@@ -1,82 +0,0 @@
-Like most LLM tasks, a critical part of improving LLM performance in autolabeling tasks is selecting a good prompt. Often, this entails finding a good balance between a descriptive set of instructions, while still remaining concise and clear. 
-
-Consider the following example of refining a prompt used for a classification task on the civil-comments dataset. Each labeling run below included 500 examples and used the same LLM: gpt-3.5-turbo and used a fixed-shot example selection strategy with 4 seed examples.
-
-[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IVHl2h5mxiFs1b5AwTtUKVqs0MXKXX8g#scrollTo=IYj0ijdKylNu)
-
-First attempt:
-```json
-config = {
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-        "compute_confidence": True
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at identifying toxic comments and understanding if a comment is sexually explicit, obscene, toxic, insults a person, demographic or race. \nYour job is to correctly label the provided input example into one of the following categories:\n{labels}",
-        "labels": [
-            "toxic",
-            "not toxic"
-        ],
-        "few_shot_examples": "../examples/civil_comments/seed.csv",
-        "few_shot_selection": "fixed",
-        "few_shot_num": 4,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-
-agent = LabelingAgent(config=config)
-dataset = AutolabelDataset('../examples/civil_comments/test.csv', config=config)
-labeled_dataset = agent.run(dataset, max_items = 100)
-```
-
-Accuracy: 68%
-
-This first basic prompt seems clear and concise, but only attains a baseline accuracy of 68%. We can analyze some of the errors the LLM is making to get a better idea of how to improve our prompt. 
-
-```py
-df[df['label'] != df['ToxicCommentClassification_llm_label']]
-```
-
-In doing so, we notice that a vast majority of the errors (97.2%) are misclassifications of civil comments as toxic by the LLM. For instance, one such example comment is:
-
-```json
-'This is malfeasance by the Administrator and the Board. They are wasting our money!'
-```
-
-The presence of generally negative words such as "malfeasance" and "wasting" may be misleading the LLM. Our prompt may need to include details that guide the LLM to correctly identify cases where the vocabulary used could be mistaken as toxic, but the surrounding context suggests that the comment is actually civil.
-
-
-Adding nuance to the prompt:
-
-We can replace the prompt in the above config with the following updated guidelines and re-run the labeling task.
-
-```json
-"task_guidelines": "You are an expert at identifying toxic comments. You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material. Any comments that are sexually explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'.\nYour job is to correctly label the provided input example into one of the following categories:\n{labels}",
-```
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-
-agent = LabelingAgent(config=config)
-dataset = AutolabelDataset('../examples/civil_comments/test.csv', config=config)
-labeled_dataset = agent.run(dataset, max_items = 100)
-```
-
-Accuracy: 74%
-
-In this second iteration, we added more detail to the prompt such as addressing the nuances between "fair criticisms" vs. toxic comments. These additional details lead to better performance, reaching 74% accuracy. From a similar analysis of the LLM errors, we see that the previous misclassification example, along with several other similar ones, has now been correctly labeled.
-
-Further improvements:
-
-After subsequently experimenting with a few different variations to this prompt, we do not see significant improvements in performance for this task. As a result, after sufficient iteration of the prompt, it is better to look for performance gains through other modifications to the task configuration. For example, comparing different LLMs can often lead to significant improvements. With the same final prompt above, the text-davinci-003 model achieved 88% accuracy, a 14% increase compared to gpt-turbo-3.5.
diff --git a/docs/autolabel/guide/llms/benchmarks.md b/docs/autolabel/guide/llms/benchmarks.md
deleted file mode 100644
index 0061084..0000000
--- a/docs/autolabel/guide/llms/benchmarks.md
+++ /dev/null
@@ -1,10 +0,0 @@
-
-## Benchmarking LLMs for data labeling
-
-
-Key takeaways from our [technical report](https://www.refuel.ai/blog-posts/llm-labeling-technical-report):
-
-* State of the art LLMs can label text datasets at the same or better quality compared to skilled human annotators, **but ~20x faster and ~7x cheaper**.
-* For achieving the highest quality labels, GPT-4 is the best choice among out of the box LLMs (88.4% agreement with ground truth, compared to 86% for skilled human annotators). 
-* For achieving the best tradeoff between label quality and cost, GPT-3.5-turbo, PaLM-2 and open source models like FLAN-T5-XXL are compelling.
-* Confidence based thresholding can be a very effective way to mitigate impact of hallucinations and ensure high label quality.
diff --git a/docs/autolabel/guide/llms/embeddings.md b/docs/autolabel/guide/llms/embeddings.md
deleted file mode 100644
index 7f8ba03..0000000
--- a/docs/autolabel/guide/llms/embeddings.md
+++ /dev/null
@@ -1,120 +0,0 @@
-# Embedding Models
-
-Autolabel also supports various models to compute text embeddings that are used in some few shot example selection strategies such as [semantic similarity and max marginal relevance](../accuracy/few-shot.md). Like the LLMs that Autolabel supports, each embedding model belongs to a provider. Currently the library supports embedding models from 3 providers: OpenAI, Google Vertex AI, and Huggingface. By default, if no embedding config is present in the labeling config but a few shot strategy that requires text embeddings is enabled, Autolabel defaults to use OpenAI embeddings and an OpenAI API key will be required. 
-
-Details on how to set up the embedding config for each provider are below.
-
-## OpenAI
-To use models from [OpenAI](https://platform.openai.com/docs/models), you can set `provider` to `openai` under the `embedding` key in the labeling configuration. Then, the specific model that will be queried can be specified using the `model` key. The default embedding model, if none is provided, is `text-embedding-ada-002`
-
-### Setup
-To use OpenAI models with Autolabel, make sure to first install the relevant packages by running:
-```bash
-pip install 'refuel-autolabel[openai]'
-```
-and also setting the following environment variable:
-```
-export OPENAI_API_KEY=<your-openai-key>
-```
-replacing `<your-openai-key>` with your API key, which you can get from [here](https://platform.openai.com/account/api-keys).
-
-### Example usage
-Here is an example of setting config to a dictionary that will use OpenAI's `text-embedding-ada-002` model for computing text embeddings. Specifically, note that in the dictionary provided by the `embedding` tag, `provider` is set to `openai` and `model` is not set so it will default to `text-embedding-ada-002`.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-        "params": {}
-    },
-    "embedding": {
-        "provider": "openai"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-## Hugging Face
-To use models from [Hugging Face](https://huggingface.co/), you can set `provider` to `huggingface_pipeline` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. 
-
-This will run the model locally on a GPU (if available). You can also specify  quantization strategy to load larger models in lower precision (and thus decreasing memory requirements).
-
-### Setup
-To use Hugging Face models with Autolabel, make sure to first install the relevant packages by running:
-```bash
-pip install 'refuel-autolabel[huggingface]'
-```
-
-### Example usage
-Here is an example of setting config to a dictionary that will use the `sentence-transformers/all-mpnet-base-v2` model for computing text embeddings. Specifically, note that in the dictionary provided by the `embedding` tag, `provider` is set to `huggingface_pipeline` and `model` is set to be `sentence-transformers/all-mpnet-base-v2`.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "huggingface_pipeline",
-        "name": "google/flan-t5-small",
-        "params": {}
-    },
-    "embedding": {
-        "provider": "huggingface_pipeline",
-        "model": "sentence-transformers/all-mpnet-base-v2"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-## Google Vertex AI
-To use models from [Google](https://developers.generativeai.google/products/palm), you can set the `provider` to `google` when creating a labeling configuration. The specific model that will be queried can be specified using the `model` key. 
-
-### Setup
-To use Google models with Autolabel, make sure to first install the relevant packages by running:
-```bash
-pip install 'refuel-autolabel[google]'
-```
-and also setting up [Google authentication](https://cloud.google.com/docs/authentication/application-default-credentials) locally.
-
-### Example usage
-Here is an example of setting config to a dictionary that will use google's `textembedding-gecko` model for computing text embeddings. Specifically, note that in the dictionary provided by the `embedding` tag, `provider` is set to `google` and `model` is set to be `textembedding-gecko`.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "google",
-        "name": "text-bison@001",
-        "params": {}
-    },
-    "embedding": {
-        "provider": "google",
-        "model": "textembedding-gecko"
-    }
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
diff --git a/docs/autolabel/guide/llms/llms.md b/docs/autolabel/guide/llms/llms.md
deleted file mode 100644
index be2df1b..0000000
--- a/docs/autolabel/guide/llms/llms.md
+++ /dev/null
@@ -1,466 +0,0 @@
-# Large Language Models (LLMs)
-
-Autolabel supports multiple LLMs for labeling data. Some LLMs are available by calling an API with the appropriate API keys (OpenAI, Anthropic, etc.) while others can be run locally (such as the ones available on Hugging Face). The LLM used to label can be controlled using the `provider` and `name` keys in the dictionary specified under `model` in the input config.
-
-Each LLM belongs to an LLM provider -- which refers to the organization or open-source framework through which we are able to access the LLM. A full list of LLM providers and LLMs that are currently supported is provided towards the end of this page.
-
-Autolabel makes it easy to try out different LLMs for your task and this page will walk you through how to get started with each LLM provider and model. Separately, we've also benchmarked multiple LLMs across different datasets - you can read the full technical report here [link to blog post] or check out the latest benchmark results [here](benchmarks.md).
-
-## Refuel
-
-To use models hosted by [Refuel](https://refuel.ai/), you can set `provider` to `refuel` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. Autolabel currently supports two models:
-
-- `refuel-llm`
-- `llama-13b-chat`
-
-You can access RefuelLLM, our recently announced LLM purpose built for data labeling, through Autolabel (Read more about it in this [blog post](http://www.refuel.ai/blog-posts/announcing-refuel-llm)). Refuel LLM is a Llama-v2-13b base model, instruction tuned on over 2500 unique (5.24B tokens) labeling tasks spanning categories such as classification, entity resolution, matching, reading comprehension and information extraction. You can experiment with the model in the playground [here](https://app.refuel.ai/playground).
-
-<img alt="Refuel Performance" src="/assets/refuel_llm_performance.png" width="100%">
-
-You can request access to Refuel LLM [here](https://refuel-ai.typeform.com/llm-access). Read the docs about using RefuelLLM in autolabel [here](https://docs.refuel.ai/guide/llms/llms/#refuel).
-
-Llama-13b-chat is a 13 billion parameter model available on [Huggingface](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf). However, running such a huge model locally is a challenge, which is why we are currently hosting the model on our servers.
-
-### Setup
-
-To use Refuel models with Autolabel, make sure set the following environment variable:
-
-```
-export REFUEL_API_KEY=<your-refuel-key>
-```
-
-replacing `<your-refuel-key>` with your API key.
-
-### Getting a Refuel API key
-
-If you're interested in trying one of the LLMs hosted by Refuel, sign up for your Refuel API key by filling out the form <a href="https://refuel-ai.typeform.com/llm-access" target="_blank">here</a>. We'll review your application and get back to you soon!
-
-### Example usage
-
-Here is an example of setting config to a dictionary that will use Refuel's `refuel-llm` model. Specifically, note that in the dictionary proivded by the `model` tag, `provider` is set to `refuel` and `name` is set to be `refuel-llm`.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "refuel",
-        "name": "refuel-llm",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-### Additional parameters
-
-A few parameters that can be passed in for `refuel` models to control the model behavior. For example:
-
-- `max_new_tokens` (int) - The maximum tokens to sample from the model
-- `temperature` (float) - A float b/w 0 and 1 which indicates the diversity you want in the output. 0 uses greedy sampling.
-
-These parameters can be passed in via the `params` dictionary under `model`. Here is an example:
-
-```python
-"model": {
-    "provider": "refuel",
-    "name": "refuel-llm",
-    "params": {
-        "max_new_tokens": 512,
-        "temperature": 0.1,
-    }
-}
-```
-
-`refuel` hosted LLMs support all the parameters that can be passed as a part of [GenerationConfig](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig) while calling generate functions of Hugging Face LLMs.
-
-## OpenAI
-
-To use models from [OpenAI](https://platform.openai.com/docs/models), you can set `provider` to `openai` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. Autolabel currently supports the following models from OpenAI:
-
-- `text-davinci-003`
-- `gpt-3.5-turbo`, `gpt-3.5-turbo-0301` and `gpt-3.5-turbo-0613` (4,096 max tokens)
-- `gpt-3.5-turbo-16k` and `gpt-3.5-turbo-16k-0613` (16,384 max tokens)
-- `gpt-4`, `gpt-4-0314` and `gpt-4-0613` (8,192 max tokens)
-- `gpt-4-32k`, `gpt-4-32k-0314` and `gpt-4-32k-0613` (32,768 max tokens)
-
-`gpt-4` set of models are the most capable (and most expensive) from OpenAI, while `gpt-3.5-turbo` set of models are cheap (but still quite capable). Detailed pricing for these models is available [here](https://openai.com/pricing).
-
-### Setup
-
-To use OpenAI models with Autolabel, make sure to first install the relevant packages by running:
-
-```bash
-pip install 'refuel-autolabel[openai]'
-```
-
-and also setting the following environment variable:
-
-```
-export OPENAI_API_KEY=<your-openai-key>
-```
-
-replacing `<your-openai-key>` with your API key, which you can get from [here](https://platform.openai.com/account/api-keys).
-
-### Example usage
-
-Here is an example of setting config to a dictionary that will use OpenAI's `gpt-3.5-turbo` model for labeling. Specifically, note that in the dictionary proivded by the `model` tag, `provider` is set to `openai` and `name` is set to be `gpt-3.5-turbo`. `name` can be switched to use any of the three models mentioned above.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-### Additional parameters
-
-A few parameters can be passed in alongside `openai` models to tweak their behavior:
-
-- `max_tokens` (int): The maximum tokens to sample from the model
-- `temperature` (float): A float between 0 and 2 which indicates the diversity you want in the output. 0 uses greedy sampling (picks the most likely outcome).
-
-These parameters can be passed in via the `params` dictionary under `model`. Here is an example:
-
-```python
-"model": {
-    "provider": "openai",
-    "name": "gpt-3.5-turbo",
-    "params": {
-        "max_tokens": 512,
-        "temperature": 0.1
-    }
-}
-```
-
-## Anthropic
-
-To use models from [Anthropic](https://www.anthropic.com/index/introducing-claude), you can set the `provider` to `anthropic` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. Autolabel currently supports the following models from Anthropic:
-
-- `claude-instant-v1`
-- `claude-v1`
-
-`claude-v1` is a state-of-the-art high-performance model, while `claude-instant-v1` is a lighter, less expensive, and much faster option. `claude-instant-v1` is ~6.7 times cheaper than `claude-v1`, at $1.63/1 million tokens. On the other hand `claude-v1` costs $11.02/1 million tokens.
-
-### Setup
-
-To use Anthropic models with Autolabel, make sure to first install the relevant packages by running:
-
-```bash
-pip install 'refuel-autolabel[anthropic]'
-```
-
-and also setting the following environment variable:
-
-```
-export ANTHROPIC_API_KEY=<your-anthropic-key>
-```
-
-replacing `<your-anthropic-key>` with your API key, which you can get from [here](https://console.anthropic.com/docs/access).
-
-### Example usage
-
-Here is an example of setting config to a dictionary that will use anthropic's `claude-instant-v1` model for labeling. Specifically, note that in the dictionary proivded by the `model` tag, `provider` is set to `anthropic` and `name` is set to be `claude-instant-v1`. `name` can be switched to use any of the two models mentioned above.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "anthropic",
-        "name": "claude-instant-v1",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-### Additional parameters
-
-A few parameters that can be passed in for `anthropic` models to control the model behavior:
-
-- `max_tokens_to_sample` (int): The maximum tokens to sample from the model
-- `temperature` (float): A float between 0 and 2 which indicates the diversity you want in the output. 0 uses greedy sampling (picks the most likely outcome).
-
-These parameters can be passed in via the `params` dictionary under `model`. Here is an example:
-
-```python
-"model": {
-    "provider": "anthropic",
-    "name": "claude-instant-v1",
-    "params": {
-        "max_tokens_to_sample": 512,
-        "temperature": 0.1
-    }
-}
-```
-
-## Hugging Face
-
-To use models from [Hugging Face](https://huggingface.co/), you can set `provider` to `huggingface_pipeline` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. Autolabel currently supports all Sequence2Sequence and Causal Language Models on Hugging Face. All models available on Hugging Face can be found [here](https://huggingface.co/docs/transformers/model_doc/openai-gpt#:~:text=TEXT-,MODELS,-ALBERT). Ensure that the model you choose can be loaded using `AutoModelForSeq2SeqLM` or `AutoModelForCausalLM`. Here are a few examples:
-
-Sequence2Sequence Language Models:
-
-- `google/flan-t5-small` (all flan-t5-\* models)
-- `google/pegasus-x-base`
-- `microsoft/prophetnet-large-uncased`
-
-Causal Language Models:
-
-- `gpt2`
-- `openlm-research/open_llama_3b`
-- `meta-llama/Llama-2-7b`
-
-This will run the model locally on a GPU (if available). You can also specify quantization strategy to load larger models in lower precision (and thus decreasing memory requirements).
-
-### Setup
-
-To use Hugging Face models with Autolabel, make sure to first install the relevant packages by running:
-
-```bash
-pip install 'refuel-autolabel[huggingface]'
-```
-
-### Example usage
-
-Here is an example of setting config to a dictionary that will use `google/flan-t5-small` model for labeling via Hugging Face. Specifically, note that in the dictionary proivded by the `model` tag, `provider` is set to `huggingface_pipeline` and `name` is set to be `google/flan-t5-small`. `name` can be switched to use any model that satisfies the constraints above.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "huggingface_pipeline",
-        "name": "google/flan-t5-small",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-### Additional parameters
-
-A few parameters that can be passed in for `huggingface_pipeline` models to control the model behavior:
-
-- `max_new_tokens` (int) - The maximum tokens to sample from the model
-- `temperature` (float) - A float b/w 0 and 1 which indicates the diversity you want in the output. 0 uses greedy sampling.
-- `quantize` (int) - The model quantization to use. 32 bit by default, but we also support 16 bit and 8 bit support for models which have been hosted on Hugging Face.
-
-These parameters can be passed in via the `params` dictionary under `model`. Here is an example:
-
-```python
-"model": {
-    "provider": "huggingface_pipeline",
-    "name": "google/flan-t5-small",
-    "params": {
-        "max_new_tokens": 512,
-        "temperature": 0.1,
-        "quantize": 8
-    }
-},
-```
-
-To use Llama 2, you can use the following model configuration:
-
-```python
-"model": {
-    "provider": "huggingface_pipeline",
-    "name": "meta-llama/Llama-2-7b",
-}
-```
-
-## Google PaLM
-
-To use models from [Google](https://developers.generativeai.google/products/palm), you can set the `provider` to `google` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. Autolabel currently supports the following models from Google:
-
-- `text-bison@001`
-- `chat-bison@001`
-
-`text-bison@001` is often more suitable for labeling tasks due to its ability to follow natural language instructions. `chat-bison@001` is fine-tuned for multi-turn conversations. `text-bison@001` costs $0.001/1K characters and `chat-bison@001` costs half that at $0.0005/1K characters. Detailed pricing for these models is available [here](https://cloud.google.com/vertex-ai/pricing#generative_ai_models)
-
-### Setup
-
-To use Google models with Autolabel, make sure to first install the relevant packages by running:
-
-```bash
-pip install 'refuel-autolabel[google]'
-```
-
-and also setting up [Google authentication](https://cloud.google.com/docs/authentication/application-default-credentials) locally.
-
-### Example usage
-
-Here is an example of setting config to a dictionary that will use google's `text-bison@001` model for labeling. Specifically, note that in the dictionary provided by the `model` tag, `provider` is set to `google` and `name` is set to be `text-bison@001`. `name` can be switched to use any of the two models mentioned above.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "google",
-        "name": "text-bison@001",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-### Additional parameters
-
-A few parameters can be passed in alongside `google` models to tweak their behavior:
-
-- `max_output_tokens` (int): Maximum number of tokens that can be generated in the response.
-- `temperature` (float): A float between 0 and 1 which indicates the diversity you want in the output. 0 uses greedy sampling (picks the most likely outcome).
-
-These parameters can be passed in via the `params` dictionary under `model`. Here is an example:
-
-```python
-"model": {
-    "provider": "google",
-    "name": "text-bison@001",
-    "params": {
-        "max_output_tokens": 512,
-        "temperature": 0.1
-    }
-}
-```
-
-### Model behavior
-
-`chat-bison@001` always responds in a "chatty" manner (example below), often returning more than just the requested label. This can cause problems on certain labeling tasks.
-
-### Content moderation
-
-Both Google LLMs seem to have much stricter content moderation rules than the other supported models. This can cause certain labeling jobs to completely fail as shown in our technical report [add link to technical report]. Consider a different model if your dataset has content that is likely to trigger Google's built-in content moderation.
-
-## Cohere
-
-To use models from [Cohere](https://cohere.com/), you can set the `provider` to `cohere` when creating a labeling configuration. The specific model that will be queried can be specified using the `name` key. Autolabel currently supports the following models from Cohere:
-
-- `command` (4096 max tokens)
-- `command-light` (4096 max tokens)
-- `base` (2048 max tokens)
-- `base-light` (2048 max tokens)
-
-`command` is an instruction-following conversational model that performs language tasks with high quality, while `command-light` is an almost as capable, but much faster option. `base` is a model that performs generative language tasks, while `base-light` much faster but a little less capable. All models cost the same at $15/1 million tokens. Detailed pricing for these models is available [here](https://cohere.com/pricing).
-
-### Setup
-
-To use Cohere models with Autolabel, make sure to first install the relevant packages by running:
-
-```bash
-pip install 'refuel-autolabel[cohere]'
-```
-
-and also setting the following environment variable:
-
-```
-export COHERE_API_KEY=<your-cohere-key>
-```
-
-replacing `<your-cohere-key>` with your API key, which you can get from [here](https://dashboard.cohere.ai/).
-
-### Example usage
-
-Here is an example of setting config to a dictionary that will use cohere's `command` model for labeling. Specifically, note that in the dictionary proivded by the `model` tag, `provider` is set to `cohere` and `name` is set to be `command`. `name` can be switched to use any of the four models mentioned above.
-
-```python
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "cohere",
-        "name": "command",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions.",
-        "example_template": "Question: {question}\nAnswer: {answer}"
-    }
-}
-```
-
-### Additional parameters
-
-A few parameters that can be passed in for `cohere` models to control the model behavior:
-
-- `max_tokens` (int): The maximum number of tokens to predict per generation
-- `temperature` (float): The degree of randomness in generations from 0.0 to 5.0, lower is less random.
-
-These parameters can be passed in via the `params` dictionary under `model`. Here is an example:
-
-```python
-"model": {
-    "provider": "cohere",
-    "name": "command",
-    "params": {
-        "max_tokens": 512,
-        "temperature": 0.1
-    }
-}
-```
-
-## Provider List
-
-The table lists out all the provider, model combinations that Autolabel supports today:
-
-| Provider             | Name                                                                                                                                                                                                                 |
-| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| openai               | text-davinci-003                                                                                                                                                                                                     |
-| openai               | [gpt-3.5-turbo models](https://platform.openai.com/docs/models/gpt-3-5)                                                                                                                                              |
-| openai               | [gpt-4 models](https://platform.openai.com/docs/models/gpt-4)                                                                                                                                                        |
-| anthropic            | claude-v1                                                                                                                                                                                                            |
-| anthropic            | claude-instant-v1                                                                                                                                                                                                    |
-| huggingface_pipeline | [seq2seq models](https://huggingface.co/learn/nlp-course/chapter1/7?fw=pt#sequencetosequence-modelssequencetosequencemodels) and [causalLM models](https://huggingface.co/docs/transformers/tasks/language_modeling) |
-| refuel               | flan-t5-xxl                                                                                                                                                                                                          |
-| google               | text-bison@001                                                                                                                                                                                                       |
-| google               | chat-bison@001                                                                                                                                                                                                       |
-| cohere               | command                                                                                                                                                                                                              |
-| cohere               | command-light                                                                                                                                                                                                        |
-| cohere               | base                                                                                                                                                                                                                 |
-| cohere               | base-light                                                                                                                                                                                                           |
diff --git a/docs/autolabel/guide/overview/getting-started.md b/docs/autolabel/guide/overview/getting-started.md
deleted file mode 100644
index b50aa6c..0000000
--- a/docs/autolabel/guide/overview/getting-started.md
+++ /dev/null
@@ -1,177 +0,0 @@
-# Getting Started with Autolabel
-
-This page will walk you through your very first labeling task using Refuel Autolabel. Specifically, it'll go over:
-
-- Installation
-- Overview of a dataset to label
-- Labeling the dataset using Autolabel
-
-## Installation
-
-Autolabel is available on PyPI and can be installed by running:
-
-```bash
-pip install 'refuel-autolabel[openai]'
-```
-
-Separate from the Autolabel library, you'll also need to install an integration with your favorite LLM provider. In the example below, we'll be using OpenAI, so you'll need to install the OpenAI SDK and set your API key as an environment variable:
-
-```bash
-export OPENAI_API_KEY="<your-openai-key>"
-```
-
-To use a different LLM provider, follow the documentation [here](../llms/llms.md).
-
-## Goal: Sentiment Analysis on a Movie Review Dataset
-
-Let's say we wanted to run sentiment analysis on a dataset of movie reviews. We want to train our own ML model, but first, we need to label some data for training.
-
-Now, we could label a few hundred examples by hand which would take us a few hours. Instead, let's use Autolabel to get a clean, labeled dataset in a few minutes.
-
-A dataset[^1] containing 200 unlabeled movie reviews is available [here](https://github.com/refuel-ai/autolabel/blob/main/docs/assets/movie_reviews_preview.csv), and a couple of examples (with labels) are shown below:
-
-{{ read_csv('docs/assets/movie_reviews_preview.csv') }}
-
-Our goal is to label the full 200 examples using Autolabel.
-
-[^1]: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). [Learning Word Vectors for Sentiment Analysis](https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf). The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
-
-## Labeling with AutoLabel
-
-Autolabel provides a simple 3-step process for labeling data:
-
-- Specify the configuration of your labeling task as a JSON
-- Preview the labeling task against your dataset
-- Label your data!
-
-### Specify the labeling task via configuration
-
-First, create a JSON file that specifies:
-
-- Task: `task_name` is `MovieSentimentReview` and the `task_type` is `classification`
-- LLM: Choice of LLM provider and model - here we are using `gpt-3.5-turbo` from OpenAI
-- Instructions: These are the labeling guidelines provided to the LLM for labeling
-
-```python
-config = {
-    "task_name": "MovieSentimentReview",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label"
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at analyzing the sentiment of movie reviews. Your job is to classify the provided movie review into one of the following labels: {labels}",
-        "labels": [
-            "positive",
-            "negative",
-            "neutral"
-        ],
-        "few_shot_examples": [
-            {
-                "example": "I got a fairly uninspired stupid film about how human industry is bad for nature.",
-                "label": "negative"
-            },
-            {
-                "example": "I loved this movie. I found it very heart warming to see Adam West, Burt Ward, Frank Gorshin, and Julie Newmar together again.",
-                "label": "positive"
-            },
-            {
-                "example": "This movie will be played next week at the Chinese theater.",
-                "label": "neutral"
-            }
-        ],
-        "example_template": "Example: {example}\nLabel: {label}"
-    }
-}
-```
-
-*To create a custom configuration, you can use the [CLI](https://docs.refuel.ai/guide/resources/CLI) or [write your own](https://docs.refuel.ai/guide/resources/configs/).*
-
-### Preview the labeling against your dataset
-
-First import `autolabel`, create a `LabelingAgent` object and then run the `plan` command against the dataset (available [here](https://docs.refuel.ai/guide/resources/refuel_datasets/) and can be downloaded through the `autolabel.get_data` function):
-
-```python
-from autolabel import LabelingAgent, AutolabelDataset, get_data
-get_data('movie_reviews')
-
-agent = LabelingAgent(config)
-ds = AutolabelDataset('test.csv', config = config)
-agent.plan(ds)
-```
-
-This produces:
-
-```
-Computing embeddings... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0:00:00 0:00:00
-┌──────────────────────────┬─────────┐
-│ Total Estimated Cost     │ $0.538  │
-│ Number of Examples       │ 200     │
-│ Average cost per example │ 0.00269 │
-└──────────────────────────┴─────────┘
-───────────────────────────────────────────── Prompt Example ─────────────────────────────────────────────
-You are an expert at analyzing the sentiment of moview reviews. Your job is to classify the provided movie review as positive or negative.
-
-You will return the answer with just one element: "the correct label"
-
-Now I want you to label the following example:
-Input: I was very excited about seeing this film, anticipating a visual excursus on the relation of artistic beauty and nature, containing the kinds of wisdom the likes of "Rivers and Tides." However, that's not what I received. Instead, I get a fairly uninspired film about how human industry is bad for nature. Which is clearly a quite unorthodox claim.<br /><br />The photographer seems conflicted about the aesthetic qualities of his images and the supposed "ethical" duty he has to the workers occasionally peopling the images, along the periphery. And frankly, the images were not generally that impressive. And according to this "artist," scale is the basis for what makes something beautiful.<br /><br />In all respects, a stupid film. For people who'd like to feel better about their environmental consciousness ... but not for any one who would like to think about the complexities of the issues surrounding it.
-Output:
-──────────────────────────────────────────────────────────────────────────────────────────────────────────
-```
-
-This shows you:
-
-- Number of examples to be labeled in the dataset: `200`
-- Estimated cost of running this labeling task: `<$1`
-- Exact prompt being sent to the LLM
-
-Having previewed the labeling, we are ready to start labeling.
-
-### Label your dataset
-
-Now, you can use the `run` command to label:
-
-```python
-ds = AutolabelDataset('docs/assets/movie_reviews.csv', config = config)
-ds = agent.run(ds)
-```
-
-This takes just a few minutes to run, and returns the labeled data as an Autolabel Dataset. We can explore this by running:
-
-```python
-ds.df.head()
->
-                                                text  ... MovieSentimentReview_llm_label
-0  I was very excited about seeing this film, ant...  ...                       negative
-1  Serum is about a crazy doctor that finds a ser...  ...                       negative
-2  This movie was so very badly written. The char...  ...                       negative
-3  Hmmmm, want a little romance with your mystery...  ...                       negative
-4  I loved this movie. I knew it would be chocked...  ...                       positive
-
-[5 rows x 4 columns]
-```
-
-At this point, we have a labeled dataset ready, and we can begin training our ML models.
-
-### Using Hugging Face datasets
-
-If you want to use a Hugging Face dataset directly, you can pass it into `agent.plan` and `agent.run` as you would a file path or `pandas.DataFrame`.
-
-```python
-dataset = load_dataset(DATASET_NAME)
-agent = LabelingAgent(config)
-
-agent.plan(test_dataset)
-agent.run(test_dataset)
-```
-
-## Summary
-
-In this simple walkthrough, we have installed `autolabel`, gone over an example dataset to label (sentiment analysis for moview reviews) and used `autolabel` to label this dataset in just a few minutes.
-
-We hope that this gives you a glimpse of what you can do with Refuel. There are many other [labeling tasks](../tasks/classification_task.md) available within Autolabel, and if you have any questions, join our community <a href="https://discord.gg/uEdr8nrMGm" target="_blank">here</a> or [open an issue](https://github.com/refuel-ai/autolabel/issues/new/choose) on [Github](https://github.com/refuel-ai/autolabel).
diff --git a/docs/autolabel/guide/overview/tutorial-classification.md b/docs/autolabel/guide/overview/tutorial-classification.md
deleted file mode 100644
index 8a1a0dd..0000000
--- a/docs/autolabel/guide/overview/tutorial-classification.md
+++ /dev/null
@@ -1,387 +0,0 @@
-This is a detailed tutorial that walks you through many features of the Autolabel library while solving a problem faced by many companies - labeling toxic comments for content moderation. We will be using OpenAI's `gpt-3.5-turbo` for the data labeling, and Refuel's LLM for confidence estimation.
-
-If you want to run this code as you follow along, check out this Colab notebook: [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1t-9vNLkyoyySAG_0w3eR98biBOXlMO-E?usp=sharing)
-
-
-## Autolabel installation
-
-Since we'll be using OpenAI along with Autolabel, we can install all necessary libraries by simply running:
-```bash
-pip install 'refuel-autolabel[openai]'
-```
-
-Now, we can set our OpenAI key as an environment variable to get started. You can always use an LLM of your choice - see more optioons and installation instructions [here](../llms/llms.md). 
-
-## Download and review dataset
-
-We'll be using a dataset called [Civil Comments](https://huggingface.co/datasets/civil_comments), which is [available through Autolabel](../resources/refuel_datasets.md). You can download it locally, by simply running:
-```python
-from autolabel import get_data
-
-get_data('civil_comments')
-```
-
-The output is:
-```
-Downloading seed example dataset to "seed.csv"...
-100% [..............................................................................] 65757 / 65757
-Downloading test dataset to "test.csv"...
-100% [............................................................................] 610663 / 610663
-```
-
-This results in two files being downloaded locally:
-
-* `seed.csv`: small dataset with labels that we'll rely on as helpful examples.
-* `test.csv`: larger dataset that we are trying to label.
-
-A few examples are shown below:
-
-| label      | examples                                                                              |
-| ---------- | ------------------------------------------------------------------------------------- |
-| `toxic`    | "The ignorance and bigotry comes from your post!"                                     |
-| `not toxic`| "This is malfeasance by the Administrator and the Board. They are wasting our money!" |
-
-## Start the labeling process
-Labeling with Autolabel is a 3-step process:
-
-* First, we specify a labeling configuration (see `config` object below) and create a `LabelingAgent`
-* Next, we do a dry-run on our dataset using the LLM specified in `config` by running `agent.plan`
-* Finally, we run the labeling with `agent.run`
-
-### Experiment #1: Try simple labeling guidelines
-
-Define the configuration file below:
-```python
-config = {
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification", # classification task
-    "dataset": {
-        "label_column": "label",
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo" # the model we want to use
-    },
-    "prompt": {
-        # very simple instructions for the LLM
-        "task_guidelines": "Does the provided comment contain 'toxic' language? Say toxic or not toxic.",
-        "labels": [ # list of labels to choose from
-            "toxic",
-            "not toxic"
-        ],
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-*To create a custom configuration, you can use the [CLI](https://docs.refuel.ai/guide/resources/CLI) or [write your own](https://docs.refuel.ai/guide/resources/configs).*
-
-Now, we do the dry-run with `agent.plan`:
-```python
-from autolabel import LabelingAgent, AutolabelDataset
-
-agent = LabelingAgent(config)
-ds = AutolabelDataset('test.csv', config = config)
-agent.plan(ds)
-```
-
-Output:
-```console
-┌──────────────────────────┬─────────┐
-│ Total Estimated Cost     │ $4.4442 │
-│ Number of Examples       │ 2000    │
-│ Average cost per example │ $0.0022 │
-└──────────────────────────┴─────────┘
-───────────────────────────────────────────────── Prompt Example ──────────────────────────────────────────────────
-Does the provided comment contain 'toxic' language? Say toxic or not toxic.
-
-You will return the answer with just one element: "the correct label"
-
-Now I want you to label the following example:
-Input: [ Integrity means that you pay your debts.]. Does this apply to President Trump too?
-Output: 
-
-```
-
-Finally, we run the data labeling:
-```python
-ds = agent.run(ds, max_items=100)
-```
-
-```
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.54     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-54% accuracy is not very good! Let's see if we can improve this further!
-
-### Experiment #2: Few-shot prompting to provide helpful examples
-
-Similar to how human labelers find it helpful to use relevant examples when making a decision, LLM performance for labeling also goes up when choosing helpful examples in the prompt. For this next experiment, we will pick a few helpful examples from `seed.csv`. More information on few-shot prompting can be found [here](../accuracy/few-shot.md).
-
-We take the previous config, and just update the following fields:
-```python
-{
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-    },
-    "prompt": {
-        "task_guidelines":  "Does the provided comment contain 'toxic' language? Say toxic or not toxic.",
-        "labels": [
-            "toxic",
-            "not toxic"
-        ],
-        "few_shot_examples": [
-            {
-                "example": "It's ridiculous that these guys are being called 'protesters'. Being armed is a threat of violence, which makes them terrorists.",
-                "label": "toxic"
-            },
-            {
-                "example": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",
-                "label": "not toxic"
-            },
-            {
-                "example": "This bitch is nuts. Who would read a book by a woman",
-                "label": "toxic"
-            },
-            {
-                "example": "It was a great show. Not a combo I'd of expected to be good together but it was.",
-                "label": "not toxic"
-            }
-        ],
-        "few_shot_selection": "fixed",
-        "few_shot_num": 4,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-That's it! We are now ready to create a `LabelingAgent` and run the same `agent.plan` and `agent.run` commands.
-
-```console
-┌──────────────────────────┬─────────┐
-│ Total Estimated Cost     │ $4.9442 │
-│ Number of Examples       │ 2000    │
-│ Average cost per example │ $0.0025 │
-└──────────────────────────┴─────────┘
-───────────────────────────────────────────────── Prompt Example ──────────────────────────────────────────────────
-Does the provided comment contain 'toxic' language? Say toxic or not toxic.
-
-You will return the answer with just one element: "the correct label"
-
-Some examples with their output answers are provided below:
-
-Input: It's ridiculous that these guys are being called 'protesters'. Being armed is a threat of violence, which makes them terrorists.
-Output: toxic
-
-Input: This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!
-Output: not toxic
-
-Input: This bitch is nuts. Who would read a book by a woman
-Output: toxic
-
-Input: It was a great show. Not a combo I'd of expected to be good together but it was.
-Output: not toxic
-
-Now I want you to label the following example:
-Input: [ Integrity means that you pay your debts.] Does this apply to President Trump too?
-Output:
-```
-
-With additional examples, the cost has gone up slightly. Now, we run the labeling with:
-
-```python
-labels, df, metrics = agent.run(ds, max_items=100)
-```
-
-```console
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.68     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-Nice! We improved performance from 54% to 68% by providing a few examples to the LLM.
-
-### Experiment #3: Improving task guidelines after reviewing errors (prompt engineering)
-
-Typically, you can improve the accuracy by reviewing mistakes and updating the task guidelines (see another example [here](../accuracy/prompting-better.md)). You can review some of the mistakes from the previous run by looking at the output Pandas DataFrame produced called `df`:
-```python
-df[df['label'] != df['ToxicCommentClassification_llm_label']].head(10)
-```
-
-Let's say we update our task guidelines to be more explicit about how should the LLM make the decision about whether a comment is toxic or not:
-
-```python
-{
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at identifying toxic comments. You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material. Any comments that are sexually explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'. \nYour job is to correctly label the provided input example into one of the following categories:\n{labels}",
-        "labels": [
-            "toxic",
-            "not toxic"
-        ],
-        "few_shot_examples": [
-            {
-                "example": "It's ridiculous that these guys are being called 'protesters'. Being armed is a threat of violence, which makes them terrorists.",
-                "label": "toxic"
-            },
-            {
-                "example": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",
-                "label": "not toxic"
-            },
-            {
-                "example": "This bitch is nuts. Who would read a book by a woman",
-                "label": "toxic"
-            },
-            {
-                "example": "It was a great show. Not a combo I'd of expected to be good together but it was.",
-                "label": "not toxic"
-            }
-        ],
-        "few_shot_selection": "fixed",
-        "few_shot_num": 4,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-Now, when we run `agent.run`, we get the following results:
-
-```
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.78     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-We now hit an accuracy of 78%, which is very promising! If we spend more time improving the guidelines or choosing different examples, we can push accuracy even further.
-
-### Experiment #4: Experimenting with LLMs
-
-We've iterated a fair bit on prompts, and few-shot examples. Let's evaluate a few different LLMs provided by the library out of the box. For example, we observe that we can boost performance even further by using `text-davinci-003`: 
-
-```python
-{
-    "task_name": "ToxicCommentClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "text-davinci-003",
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at identifying toxic comments. You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material. Any comments that are sexually explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'. \nYour job is to correctly label the provided input example into one of the following categories:\n{labels}",
-        "labels": [
-            "toxic",
-            "not toxic"
-        ],
-        "few_shot_examples": [
-            {
-                "example": "It's ridiculous that these guys are being called 'protesters'. Being armed is a threat of violence, which makes them terrorists.",
-                "label": "toxic"
-            },
-            {
-                "example": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",
-                "label": "not toxic"
-            },
-            {
-                "example": "This bitch is nuts. Who would read a book by a woman",
-                "label": "toxic"
-            },
-            {
-                "example": "It was a great show. Not a combo I'd of expected to be good together but it was.",
-                "label": "not toxic"
-            }
-        ],
-        "few_shot_selection": "fixed",
-        "few_shot_num": 4,
-        "example_template": "Input: {example}\nOutput: {label}"
-    }
-}
-```
-
-While the per token API price for this model is higher, we're able to boost the accuracy to 88%!
-
-```console
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.88     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-### Experiment #5: Using confidence scores
-
-Refuel provides LLMs that can compute confidence scores for every label, if the LLM you've chosen doesn't provide token-level log probabilities. This is helpful, because you can calibrate a confidence threshold for your labeling task, and then route less confident labels to humans, while you still get the benefits of auto-labeling for the confident examples. Let's see how this works. 
-
-First, set your Refuel API key as an environment variable (and if you don't have this key yet, sign up <a href="https://refuel-ai.typeform.com/llm-access" target="_blank">here</a>).
-```python
-os.environ['REFUEL_API_KEY'] = '<your-api-key>'
-```
-
-Now, update your configuration:
-```python
-config["model"]["compute_confidence"] = True
-```
-
-Finally, let's run `agent.run` as before - this produces the table below:
-```
-Metric: auroc: 0.8858
-Actual Cost: 0.0376
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.78     │ 1.0             │
-│ 1       │ 0.9988    │ 1.0      │ 0.01            │
-│ 12      │ 0.9957    │ 1.0      │ 0.12            │
-│ 13      │ 0.9949    │ 0.9231   │ 0.13            │
-│ 54      │ 0.9128    │ 0.9815   │ 0.54            │
-│ 55      │ 0.9107    │ 0.9636   │ 0.55            │
-│ 63      │ 0.6682    │ 0.9683   │ 0.63            │
-│ 66      │ 0.6674    │ 0.9242   │ 0.66            │
-│ 67      │ 0.6673    │ 0.9254   │ 0.67            │
-│ 69      │ 0.6671    │ 0.8986   │ 0.69            │
-│ 71      │ 0.6667    │ 0.9014   │ 0.71            │
-│ 72      │ 0.6667    │ 0.8889   │ 0.72            │
-│ 78      │ 0.4819    │ 0.8974   │ 0.78            │
-│ 79      │ 0.4774    │ 0.8861   │ 0.79            │
-│ 87      │ 0.4423    │ 0.8966   │ 0.87            │
-│ 100     │ 0.0402    │ 0.78     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-The rows in this table show labeling performance at different confidence thresholds, and set an autolabeling confidence threshold at the desired accuracy. For instance, from the table above we can set the confidence threshold at 0.6682 which allows us to label at 96% accuracy with a completion rate of 63%.
-
-If you want to run this code as you follow along, check out this Colab notebook: [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1t-9vNLkyoyySAG_0w3eR98biBOXlMO-E?usp=sharing)
-
-## Final thoughts
-
-Hopefully, this tutorial was helpful in understanding how Autolabel can help you label datasets quickly and at high quality. A Jupyter notebook for this tutorial can be found [here](https://github.com/refuel-ai/autolabel/blob/main/examples/civil_comments/example_civil_comments.ipynb).
-
-You can find more example notebooks [here](https://github.com/refuel-ai/autolabel/tree/main/examples), including for tasks such as question answering, named entity recognition, etc. 
-
-Drop us a message in our <a href="https://discord.gg/uEdr8nrMGm" target="_blank">Discord</a> if you want to chat with us, or go to [Github](https://github.com/refuel-ai/autolabel/issues) to report any issues!
\ No newline at end of file
diff --git a/docs/autolabel/guide/reliability/llm-output-caching.md b/docs/autolabel/guide/reliability/llm-output-caching.md
deleted file mode 100644
index 58f2749..0000000
--- a/docs/autolabel/guide/reliability/llm-output-caching.md
+++ /dev/null
@@ -1,28 +0,0 @@
-# LLM Output Caching
-
-To help reduce time and cost when iterating the prompt for better labeling accuracy, we cache the calls made to the LLM.
-
-## Cache Entry
-
-A cache entry has the following attributes:
-
-- `Model Name`
-- `Prompt`
-- `Model Params`
-
-This means that anytime there are changes to either the language model or the prompt, the model will be called for producing label. Also, changes to the model parameters like the `max_tokens` or `temperature` could affect the label output and therefore modifying such parameters result in new calls to the model instead of using cached calls.
-
-## Caching Storage
-
-The cached entries are stored in a SQLite database. We will be adding support for In Memory cache and Redis cache in future.
-
-## Disable Caching
-
-The cache is enabled by default and if you wish to disable it, you can set `cache=False` when initializing the LabelingAgent.
-
-```python
-
-from autolabel import LabelingAgent
-
-agent = LabelingAgent(config='examples/configs/civil_comments.json', cache=False)
-```
diff --git a/docs/autolabel/guide/reliability/state-management.md b/docs/autolabel/guide/reliability/state-management.md
deleted file mode 100644
index 31b9d34..0000000
--- a/docs/autolabel/guide/reliability/state-management.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# State Management
-
-Labeling a large dataset can take some time and if you're running the task on a Jupyter notebook and your machine decides to sleep during the time, it could be really frustrating. (we've been there! :crying_cat_face:).
-
-Therefore, we periodically save the progress of the labeling task in a SQLite database, so if the task is interrupted, you can resume it from where you left off.
-
-## Task Run State
-
-When a labeling task is triggered, a task run entry gets initialized inside the database. We maintain the dataset index till where the labels have been computed. After every small chunk (size 5) of data gets labeled, the dataset index gets updated and the labels are persisted.
-
-In case the labeling process get interrupted/terminated and you trigger the task with the same parameters again, the library first checks for a previous instance of the same task.
-
-If there was an incomplete task present, you would be prompted with details of the previous run and asked to resume the task.
-If you choose to resume the previous task, it gets loaded into the memory and resumed from previous state otherwise the previous entry gets deleted.
-
-## Deep Dive
-
-You'd likely never have to interact with the database directly but in case you wish to look at the state of the database, you can do that using any CLI or GUI that supports SQL.
-The database is saved in the same directory from where you run the LabelingAgent notebook and is named `.autolabel.db`.
-
-We have the following tables:
-
-- `datasets`: Stores the dataset file information
-- `tasks`: Stores the labeling task attributes
-- `task_runs`: Stores the current state of a labeling task run
-- `annotations`: Stores the LLM annotation corresponding to the task run
-- `generation_cache`: Cache for the LLM calls
diff --git a/docs/autolabel/guide/resources/CLI.md b/docs/autolabel/guide/resources/CLI.md
deleted file mode 100644
index dd2ec60..0000000
--- a/docs/autolabel/guide/resources/CLI.md
+++ /dev/null
@@ -1,286 +0,0 @@
-The Autolabel CLI was created to make the [config](https://docs.refuel.ai/guide/resources/configs) file creation process easier. It is a simple command line interface that will ask you a series of questions and then generate a config file for you. To use it, simply run the following command:
-
-```bash
-autolabel config
-```
-
-### **Walkthrough: Creating a Config for Civil Comments**
-
-<ol>
-<li>
-The first step is to run the <code>autolabel</code> command with the <code>config</code> argument:
-
-```bash
-autolabel config
-```
-
-</li>
-<li>
-The program will prompt you to enter the task name.
-
-```
-Enter the task name: ToxicCommentClassification
-```
-
-</li>
-<li>
-Next, you need to choose the task type from the provided options.:
-
-```
-Choose a task type
-> classification
-  named_entity_recognition
-  question_answering
-  entity_matching
-  multilabel_classification
-```
-
-</li>
-<li>
-Now, the program will ask for dataset configuration details. You need to specify the delimiter used in your dataset, the label column name, and an optional explanation column name:
-
-```
-Dataset Configuration
-Enter the delimiter (,):
-Enter the label column name: label
-Enter the explanation column name (optional):
-```
-
-<em>Anything surrounded by parenthesis at the end of a prompt will be used as the default value if you don't input anything. Make sure to change this if it does not line up with your task.</em>
-
-</li>
-<li>
-The program will then ask for model configuration. You will need to specify the model provider from the options. Next, enter the model name, optional model parameters, whether the model should compute confidence, and the strength of the logit bias:
-
-```
-Model Configuration
-Enter the model provider
-> openai
-  anthropic
-  huggingface_pipeline
-  refuel
-  google
-  cohere
-Enter the model name: gpt-3.5-turbo
-Enter a model parameter name (or leave blank for none):
-Should the model compute confidence? [y/n] (n):
-What is the strength of logit bias? (0.0): 100
-```
-
-</li>
-<li>
-Next, you will configure the task prompt. First, enter the task guidelines. In the task guidelines, <code>{num_labels}</code> and <code>{labels}</code> will be replaced by the number of labels and the labels list respectively. Next, specify the labels. Then, write the example template with placeholders for the column names you want to use in the prompt. You can also add an output guideline and format if needed. Lastly, you can choose whether to use a chain of thought:
-
-```
-Prompt Configuration
-Enter the task guidelines (Your job is to correctly label the provided input example into one of the following {num_labels} categories.
-Categories:
-{labels}
-):
-Enter a valid label (or leave blank for none): toxic
-Enter a valid label (or leave blank to finish): not toxic
-Enter a valid label (or leave blank to finish):
-Enter the example template: Example: {example}\nLabel: {label}
-Enter the value for example (or leave blank for none):
-Enter the output guideline (optional):
-Enter the output format (optional):
-Should the prompt use a chain of thought? [y/n] (n):
-```
-
-</li>
-<li>
-The program will then display the configuration that you have provided as a python dictionary:
-
-```python
-{
-    'task_name': 'ToxicCommentClassification',
-    'task_type': 'classification',
-    'dataset': {'delimiter': ',', 'label_column': 'label'},
-    'model': {'provider': 'openai', 'name': 'gpt-3.5-turbo', 'compute_confidence': False, 'logit_bias': 100.0},
-    'prompt': {
-        'task_guidelines': 'Your job is to correctly label the provided input example into one of the following {num_labels} categories.\nCategories:\n{labels}\n',
-        'labels': ['toxic', 'not toxic'],
-        'example_template': 'Example: {example}\nLabel: {label}',
-        'chain_of_thought': False
-    }
-}
-```
-
-</li>
-<li>
-Finally, the program will write the configuration to a file named "{your_task_name}_config.json".
-
-```
-Writing config to ToxicCommentClassification_config.json
-```
-
-</li>
-</ol>
-That's it! You have successfully created a config for a task using the CLI program. The generated configuration file can now be used for any labeling runs with autolabel!
-
-### Providing a seed file
-
-You can provide a seed file to the CLI to help it generate the config file. Providing a seed file to the CLI allows it to automatically provide drop-down menus for column name inputs, detect labels that are already present in the seed file, and fill the few shot examples by row number in the seed file. To do this, simply run the following command:
-
-```bash
-autolabel config <path-to-seed-file>
-```
-
-For example, if you have a file called `seed.csv` in the current directory, you would run the following command:
-
-```bash
-autolabel config seed.csv
-```
-
-Here's an example of what the prompt configuration section would look like with a seed file:
-
-```
-Detected 2 unique labels in seed dataset. Use these labels? [y/n]: y
-Enter the example template: Example: {example}\nLabel: {label}
-Use seed.csv as few shot example dataset? [y/n]: n
-Enter the value for example or row number (or leave blank for none): 3
-{'example': "When all else fails, change the subject to Hillary's emails.", 'label': 'not toxic'}
-Enter the value for example or row number (or leave blank to finish): 7
-{
-    'example': 'He may like the internal forum, but the reality is he has affirmed traditional doctrine and practices. While he does like the internal forum he has not changed anything.',
-    'label': 'not toxic'
-}
-Enter the value for example or row number (or leave blank to finish): 24
-{'example': '........... said the blind dumb and deaf lemming.', 'label': 'toxic'}
-Enter the value for example or row number (or leave blank to finish): 64
-{
-    'example': 'Do you have a citation for that statement or did you just make it up yourself? BTW, this thread is about the unhealthy liar the Democrats have
-nominated.',
-    'label': 'not toxic'
-}
-Enter the value for example or row number (or leave blank to finish):
-Enter the few shot selection algorithm
-> fixed
-  semantic_similarity
-  max_marginal_relevance
-  label_diversity_random
-  label_diversity_similarity
-Enter the number of few shot examples to use (4):
-```
-
-As you can see, the CLI automatically detected the labels in the seed file and used them to generate the labels list. It also automatically filled the few shot examples with the examples from the seed file after letting the user choose the rows to use.
-
-### Specifying Model Parameters
-
-To specify model parameters, you can simply enter the parameter name and value when prompted. For example, if you wanted to specify the `temperature` parameter for the `gpt-3.5-turbo` model, you would run the following command:
-
-```
-Enter a model parameter name (or leave blank for none): temperature
-Enter the value for max_tokens: 0.5
-```
-
-### Providing Few Shot Examples
-
-To provide few shot examples, you can simply input the example when prompted (after entering the example template). The CLI will go through the example template and ask for any values specified in that. For example, if you template is `Example: {example}\nLabel: {label}`, you could add a few shot example as shown below:
-
-```
-Enter the example template: Example: {example}\nLabel: {label}
-Enter the value for example (or leave blank for none): You're ugly and dumb
-Enter the value for label: toxic
-Enter the value for example (or leave blank to finish): I love your art!
-Enter the value for label: not toxic
-Enter the value for example (or leave blank to finish): It was a great show. Not a combo I'd of expected to be good together but it was.
-Enter the value for label: not toxic
-Enter the value for example (or leave blank to finish): It's ridiculous that these guys are being called 'protesters'. Being armed is a threat of violence, which makes them terrorists
-Enter the value for label: toxic
-Enter the value for example (or leave blank to finish):
-Enter the few shot selection algorithm
-> fixed
-  semantic_similarity
-  max_marginal_relevance
-  label_diversity_random
-  label_diversity_similarity
-Enter the number of few shot examples to use (4):
-```
-
-Since we only added 4 examples, we chose the `fixed` few shot selection algorithm and left the number of few shot examples to use at 4 since we want to use all of them in every prompt.
-
-## The `init` command
-
-If you would prefer to edit a json file directly, you can use the `init` command to generate a config file for you. To do this, simply run the following command:
-
-```bash
-autolabel init
-```
-
-By default, this will create a config file that looks like the one below:
-
-```json
-{
-  "task_name": "[TODO] Enter task name",
-  "task_type": "[TODO] Enter task type",
-  "dataset": {
-    "delimiter": "[TODO] Enter delimiter",
-    "label_column": "[TODO] Enter label column name"
-  },
-  "model": {
-    "provider": "openai",
-    "name": "gpt-3.5-turbo",
-    "params": {}
-  },
-  "prompt": {
-    "task_guidelines": "[TODO] Enter task guidelines",
-    "example_template": "[TODO] Enter example template",
-    "few_shot_examples": "[TODO] Enter few shot examples",
-    "few_shot_selection": "[TODO] Enter few shot selection",
-    "few_shot_num": "[TODO] Enter few shot num"
-  }
-}
-```
-
-`init` will also take a seed file as an argument. Combined with other options, this can result in a very quick config file generation process. For example, if you have a file called `seed.csv` in the current directory, you could run the following command:
-
-```bash
-autolabel init seed.csv --task-name ToxicCommentClassification --task-type classification --delimiter , --label-column label --task-guidelines "You are an expert at identifying toxic comments." --example-template "Example: {example}\nLabel: {label}" --few-shot-examples seed.csv --few-shot-selection semantic_similarity --few-shot-num 5 --guess-labels
-```
-
-Resulting in the following config file for the civil comments dataset:
-
-```json
-{
-  "task_name": "ToxicCommentClassification",
-  "task_type": "classification",
-  "dataset": {
-    "delimiter": ",",
-    "label_column": "label"
-  },
-  "model": {
-    "provider": "openai",
-    "name": "gpt-3.5-turbo",
-    "params": {}
-  },
-  "prompt": {
-    "task_guidelines": "You are an expert at identifying toxic comments.",
-    "example_template": "Example: {example}\nLabel: {label}",
-    "few_shot_examples": "seed.csv",
-    "few_shot_selection": "semantic_similarity",
-    "few_shot_num": 5,
-    "labels": ["not toxic", "toxic"]
-  }
-}
-```
-
-## The `plan` command
-
-The `plan` command works identically to running `LabelingAgent({config}).plan({dataset})` in python. To use it, simply run the following command:
-
-```bash
-autolabel plan <path-to-dataset> <path-to-config>
-```
-
-## The `run` command
-
-The `run` command works identically to running `LabelingAgent({config}).run({dataset})` in python. To use it, simply run the following command:
-
-```bash
-autolabel run <path-to-dataset> <path-to-config>
-```
-
-## Help
-
-If any of the commands are unclear, you can run `autolabel --help` to see the help menu or `autolabel <command> --help` to see the help menu for a specific command.
diff --git a/docs/autolabel/guide/resources/autolabel_dataset.md b/docs/autolabel/guide/resources/autolabel_dataset.md
deleted file mode 100644
index 49dadb3..0000000
--- a/docs/autolabel/guide/resources/autolabel_dataset.md
+++ /dev/null
@@ -1,8 +0,0 @@
-## Autolabel Dataset
-
-Autolabel interacts primarily with dataset objects. These dataset objects are the input and the output for every agent function. `agent.run`, `agent.plan` and `agent.transform` all accept AutolabelDataset as an input and output an Autolabel Dataset. Use this object to talk to autolabel and run evaluations, transformations as well as understand the labels that a model outputs. We provide utility functions to help with understanding where the labeling process can be improved.
-
-::: src.autolabel.dataset.dataset.AutolabelDataset
-rendering:
-show_root_heading: yes
-show_root_full_path: no
diff --git a/docs/autolabel/guide/resources/configs.md b/docs/autolabel/guide/resources/configs.md
deleted file mode 100644
index d25cf41..0000000
--- a/docs/autolabel/guide/resources/configs.md
+++ /dev/null
@@ -1,351 +0,0 @@
-Each labeling run with the autolabel library requires a config to be specified. The config has 5 top-level keys and several nested keys, many of which are optional.
-
-
-##Task Name
-
-
-The task name is just a user-provided name for the labeling task and is only used to construct display names for various labeling artifacts (i.e. column names in the output labeled csv/dataframe)
-
-
-```json title="Example"
-"task_name": "CompanyEntityMatch"
-```
-
-
-##Task Type
-
-
-The task type determines how the Autolabel library should construct the request to the LLM as well as how the LLM response should be parsed and which metrics should be computed. Currently, the library supports the following task types:
-
-
-- entity_matching
-- classification
-- named_entity_recognition
-- question_answering
-
-
-```json title="Example"
-"task_type": "entity_matching"
-```
-
-
-##Dataset
-
-
-The dataset config contains information about the dataset to be labeled. Specifically, there are 4 dataset config keys:
-
-
-1. label_column (optional): The label column specifies the column containing the labels for each item to use for metric computation if labels are available for the dataset
-2. explanation_column (optional): The explanation column specifies the column containing explanations for each item to use for chain-of-thought prompting if it is enabled in the config.
-3. delimiter (optional): This key specifies the delimiter used for parsing the dataset CSV. By default, it is assumed to be a comma: ","
-4. text_column (required for named entity recognition): The text column is only necessary for named entity recognition tasks and specifies the column containing the text that we intend to label and is used for determining text spans.
-
-
-```json title="Example 1: Classification task"
-"dataset": {
-       "label_column": "label",
-       "delimiter": ","
-   }
-```
-
-
-```json title="Example 2: Chain of thought"
-   "dataset": {
-       "label_column": "answer",
-       "explanation_column": "explanation",
-       "delimiter": ","
-   }
-```
-
-
-```json title="Example 3: Named entity recognition task"
-   "dataset": {
-       "label_column": "CategorizedLabels",
-       "text_column": "example",
-       "delimiter": ","
-   }
-```
-
-
-
-
-
-
-##Model
-
-
-The model config contains information about the LLM provider and specific model we intend to use for labeling. There are 4 model config keys:
-
-
-1. provider: This key specifies the LLM provider.
-2. name: The model name specifies which of the provider's models to use for generating labels.
-3. params (optional): Params is a dictionary that allows the user to configure model-specific paramaters. Here is an example model params dict:
-   - max_tokens: Max tokens specifies the maximum total input and output tokens for each LLM call.
-   - temperature: The temperature controls how deterministic the LLM responses should be.
-   - model_kwargs: The model kwargs contains the logprobs key which, when present, configures the LLM request to have the LLM return log probabilities
-
-
-4. compute_confidence (optional): This boolean determines whether to compute and output confidence scores.
-
-
-```json title="Example 1: Compute confidence"
-"model": {
-       "provider": "openai",
-       "name": "gpt-3.5-turbo",
-       "compute_confidence": True
-   }
-```
-
-
-```json title="Example 2: Defining model params"
-"model": {
-   "provider": "openai",
-   "name": "gpt-3.5-turbo",
-   "params": {
-       "max_tokens": 512,
-       "temperature": 0.1
-   }
-}
-```
-
-
-##Embedding
-
-
-The embedding config contains information about the text embedding model provider and the specific model we intend to use for computing text embeddings. There are 2 embedding config keys:
-
-
-1. provider: This key specifies the text embedding model provider.
-2. model: The model specifies which of the provider's text embedding models to use for generating labels. This key is optional and a default text embedding model is used if no model is specified
-
-
-```json title="Example 1: Huggingface sentence transformers model"
-"embedding": {
-    "provider": "huggingface_pipeline",
-    "model": "sentence-transformers/all-mpnet-base-v2"
-   }
-```
-
-
-```json title="Example 2: Google model with no model name"
-"embedding": {
-    "provider": "google"
-    }
-```
-
-
-##Prompt
-
-
-The prompt config contains information about how the prompt should be constructed in the request to the LLM. There are 9 prompt config keys.
-
-
-1. task_guidelines: The task guidelines should contain a description of the specific labeling task, including any nuanced details about how to correctly label each item.
-2. labels (required for some tasks): The labels defines the full list of labels for the model.
-3. few_shot_examples (optional): The few shot examples is either a list or path to the CSV of possible seed examples to append to the prompt.
-4. few_shot_selection (optional): The few shot selection is the specific strategy to use for selecting examples to use in the prompt. Currently, there are 3 example selection strategies implemented:
-
-    - fixed
-    - semantic_similarity
-    - max_marginal_relevance
-    
-5. few_shot_num (optional): The few shot number determines how many seed examples to select and include in the prompt
-6. example_template: The example template determines how each example should be formatted in the prompt. You can reference columns from the dataset by wrapping the column name with curly braces
-7. output_guidelines (optional): The output guidelines specify how the output should be returned by the LLM (i.e. just return the label vs. format as CSV). It is not recommended to add output guidelines for most use cases as default guidelines are already set.
-8. output_format (optional): The format of the output is either "csv" or "json", but it is not recommended to override the default selection.
-9. chain_of_thought (optional): This boolean determines whether to use chain of thought in the prompt or not.
-
-
-```json title="Example 1: Classification task"
-"prompt": {
-       "task_guidelines": "You are an expert at identifying toxic comments. You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material. Any comments that are sexually explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'. \nYour job is to correctly label the provided input example into one of the following categories:\n{labels}",
-       "labels": [
-           "toxic",
-           "not toxic"
-       ],
-       "example_template": "Input: {example}\nOutput: {label}"
-   }
-```
-
-
-```json title="Example 2: Use seed examples"
-   "prompt": {
-       "task_guidelines": "You are provided with descriptions of companies from their websites, and wikipedia pages. Your job is to categorize whether the descriptions are about the same company (duplicate) or different companies (not duplicate). Your answer must be from one of the following options:\n{labels}",
-       "labels": [
-           "not duplicate",
-           "duplicate"
-       ],
-       "example_template": "Company 1 description: {entity1}\nCompany 2 description: {entity2}\nDuplicate or not: {label}",
-       "few_shot_examples": [
-           {
-               "entity1": "lac wisconsin branding 95 1 & 96 1 the rock frequency 96.1 mhz translator s 95.1 w236ag fond du lac first air date 1965 as wcwc fm at 95.9 format mainstream rock erp 4 000 watts haat 123 meters 404 ft class a facility id 54510 transmitter coordinates 43 49 10.00 n 88 43 20.00 w 43.8194444 n 88.7222222 w 43.8194444 ; 88.7222222 coordinates 43 49 10.00 n 88 43 20.00 w 43.8194444 n 88.7222222 w 43.8194444 ; 88.7222222 former callsigns wcwc fm 1965 1980 wyur 1980 1994 former frequencies 95.9 mhz 1965 affiliations cbs radio network westwood one premiere radio networks owner radio plus inc. sister stations wfdl wfdl fm wmdc webcast listen live website 961tcx . com studios in fond du lac wtcx 96.1 fm 95 1 & 96 1 the rock is a radio station broadcasting a mainstream rock music format . 1 licensed to ripon wisconsin usa the station is currently owned by radio plus inc. and features programing from cbs radio network dial global and premiere radio networks . 2 wtcx was originally on 95.9 mhz . be",
-               "entity2": "closings contact next racing rocks local news breaking wiaa releases football playoffs matchups and brackets october 15 2016 local news here are the full brackets for the state of wisconsin division 1 2 seed fond du lac hosts 7 seed milwaukee washington friday october 21 at 7pm division 5 3 seed wla hosts 6 seed ... read more 10 15 16 fdl man injured in hit and run car vs. bike crash october 15 2016 local news a fond du lac man received non life threatening injuries in a car versus bicycle hit and run crash in dodge county . the dodge county sheriff s office says shortly after 8pm friday a car ... read more 10 15 16 ripon woman remains in critical condition following one vehicle crash october 15 2016 local news a ripon woman injured in a one vehicle crash after apparently falling asleep at the wheel remains in critical condition . the fond du lac county sheriff s office says 29 year old raquel amador ... read more wiaa releases football groupings october 15 2016 local news 2016 wiaa fo",
-               "label": "duplicate"
-           },
-           {
-               "entity1": "stacy spikes hamet watt headquarters new york city united states website http www.moviepass.com moviepass is a subscription based service for going to movie theaters available in the united states . the service gives members across the country the ability to see up to one 2d movie every 24 hours for a fixed monthly fee . members may choose which theaters they wish to attend and there are no blackout dates . moviepass works in nearly all movie theaters that accept the mastercard credit card making it one of the largest subscription based theater networks in america . prices vary by local market and start at 30 per month . moviepass was launched in february 2011 and is headquartered in new york city . 1 contents 1 service 2 purchasing a ticket 3 history 4 media coverage 5 references service edit the moviepass service works via a smartphone app iphone android and a specially designed reloadable debit card which is mailed to new subscribers when they sign up . purchasing a ticket edit in o",
-               "entity2": "repair buy warranty get service buy warranty home warranty pricing & plans planning on moving home matters blog what s covered service professionals customer reviews benefits faqs appliance discount contract policies decor cost savers lawn & garden lifestyle quick tips real estate repair & maintenance tech close home warranty learn more what s covered service professionals faqs pricing and plans get a quote see plans planning on moving real estate plans buying a home selling a home home matters blog decor cost savers lawn & garden lifestyle quick tips real estate repair & maintenance tech our partner sites real estate professionals contractors 888 429 8247 email us log in back to top get a personalized quote explore plans in your area get covered in 3 easy steps . please correct highlighted fields request service log in create account oven on the fritz appliance breakdowns happen . get covered . get a personalized quote explore plans in your area get covered in 3 easy steps . please co",
-               "label": "not duplicate"
-           },
-           {
-               "entity1": "of over 110 gyms worldwide including 86 franchise locations in ma pa ny nj ct wa or ca tx fl ky va puerto rico and australia and is rapidly expanding across the u.s. and around the globe . contents 1 history 2 description 3 references 4 external links history edit crunch was founded in a basement level aerobics studio in new york city s east village in 1989 by doug levine . 1 with the collaboration of fitness instructors the group fitness programming was started at crunch . offerings such as hip hop aerobics co ed action wrestling and cyked yoga cycling were introduced . 2 in clubs members have access to innovative group fitness classes state of the art equipment personal and group training full service locker rooms and much more . select locations offer an exclusive crunch retail line that can also be purchased from the crunch online store . 3 in january 2014 crunch released its online workout extension called crunch live . this subscription based online video library has over 95 work",
-               "entity2": "gallery esp en best rate guarantee check availability call us room only 1 800 990 8250 hotel air 1 800 219 2727 canada 1 855 478 2811 airport transportation travel agents close best rate guaranteebook your all inclusive stay hotel hotel air arrive departure adults 1 2 3 4 5 6 7 8 children 0 1 2 3 4 5 6 7 8 select property pacifica golf & spa resort the towers at pacifica sunset beach golf & spa resort ros resort & spa los cabos montecristo estates mazatl n emerald bay resort & spa emerald estates luxury villas departure country argentina australia austria bahamas belgium brazil canada chile colombia costa rica denmark ecuador finland france germany greece honduras iceland israel italy japan luxembourg mexico netherlands new zealand nicaragua norway panama paraguay peru portugal puerto rico republic of ireland republic of korea south africa spain sweden switzerland turks and caicos islands united kingdom united states uruguay venezuela departure city akron canton ohio reg . albany ny al",
-               "label": "not duplicate"
-           }
-       ],
-       "few_shot_selection": "fixed",
-       "few_shot_num": 3
-   }
-```
-
-
-
-
-
-
-##Full Example Configs
-```json title="Example 1: Company Entity Match"
-{
-   "task_name": "CompanyEntityMatch",
-   "task_type": "entity_matching",
-   "dataset": {
-       "label_column": "label",
-       "delimiter": ","
-   },
-   "model": {
-       "provider": "openai",
-       "name": "gpt-3.5-turbo"
-   },
-   "prompt": {
-       "task_guidelines": "You are provided with descriptions of companies from their websites, and wikipedia pages. Your job is to categorize whether the descriptions are about the same company (duplicate) or different companies (not duplicate). Your answer must be from one of the following options:\n{labels}",
-       "labels": [
-           "not duplicate",
-           "duplicate"
-       ],
-       "example_template": "Company 1 description: {entity1}\nCompany 2 description: {entity2}\nDuplicate or not: {label}",
-       "few_shot_examples": [
-           {
-               "entity1": "lac wisconsin branding 95 1 & 96 1 the rock frequency 96.1 mhz translator s 95.1 w236ag fond du lac first air date 1965 as wcwc fm at 95.9 format mainstream rock erp 4 000 watts haat 123 meters 404 ft class a facility id 54510 transmitter coordinates 43 49 10.00 n 88 43 20.00 w 43.8194444 n 88.7222222 w 43.8194444 ; 88.7222222 coordinates 43 49 10.00 n 88 43 20.00 w 43.8194444 n 88.7222222 w 43.8194444 ; 88.7222222 former callsigns wcwc fm 1965 1980 wyur 1980 1994 former frequencies 95.9 mhz 1965 affiliations cbs radio network westwood one premiere radio networks owner radio plus inc. sister stations wfdl wfdl fm wmdc webcast listen live website 961tcx . com studios in fond du lac wtcx 96.1 fm 95 1 & 96 1 the rock is a radio station broadcasting a mainstream rock music format . 1 licensed to ripon wisconsin usa the station is currently owned by radio plus inc. and features programing from cbs radio network dial global and premiere radio networks . 2 wtcx was originally on 95.9 mhz . be",
-               "entity2": "closings contact next racing rocks local news breaking wiaa releases football playoffs matchups and brackets october 15 2016 local news here are the full brackets for the state of wisconsin division 1 2 seed fond du lac hosts 7 seed milwaukee washington friday october 21 at 7pm division 5 3 seed wla hosts 6 seed ... read more 10 15 16 fdl man injured in hit and run car vs. bike crash october 15 2016 local news a fond du lac man received non life threatening injuries in a car versus bicycle hit and run crash in dodge county . the dodge county sheriff s office says shortly after 8pm friday a car ... read more 10 15 16 ripon woman remains in critical condition following one vehicle crash october 15 2016 local news a ripon woman injured in a one vehicle crash after apparently falling asleep at the wheel remains in critical condition . the fond du lac county sheriff s office says 29 year old raquel amador ... read more wiaa releases football groupings october 15 2016 local news 2016 wiaa fo",
-               "label": "duplicate"
-           },
-           {
-               "entity1": "stacy spikes hamet watt headquarters new york city united states website http www.moviepass.com moviepass is a subscription based service for going to movie theaters available in the united states . the service gives members across the country the ability to see up to one 2d movie every 24 hours for a fixed monthly fee . members may choose which theaters they wish to attend and there are no blackout dates . moviepass works in nearly all movie theaters that accept the mastercard credit card making it one of the largest subscription based theater networks in america . prices vary by local market and start at 30 per month . moviepass was launched in february 2011 and is headquartered in new york city . 1 contents 1 service 2 purchasing a ticket 3 history 4 media coverage 5 references service edit the moviepass service works via a smartphone app iphone android and a specially designed reloadable debit card which is mailed to new subscribers when they sign up . purchasing a ticket edit in o",
-               "entity2": "repair buy warranty get service buy warranty home warranty pricing & plans planning on moving home matters blog what s covered service professionals customer reviews benefits faqs appliance discount contract policies decor cost savers lawn & garden lifestyle quick tips real estate repair & maintenance tech close home warranty learn more what s covered service professionals faqs pricing and plans get a quote see plans planning on moving real estate plans buying a home selling a home home matters blog decor cost savers lawn & garden lifestyle quick tips real estate repair & maintenance tech our partner sites real estate professionals contractors 888 429 8247 email us log in back to top get a personalized quote explore plans in your area get covered in 3 easy steps . please correct highlighted fields request service log in create account oven on the fritz appliance breakdowns happen . get covered . get a personalized quote explore plans in your area get covered in 3 easy steps . please co",
-               "label": "not duplicate"
-           },
-           {
-               "entity1": "of over 110 gyms worldwide including 86 franchise locations in ma pa ny nj ct wa or ca tx fl ky va puerto rico and australia and is rapidly expanding across the u.s. and around the globe . contents 1 history 2 description 3 references 4 external links history edit crunch was founded in a basement level aerobics studio in new york city s east village in 1989 by doug levine . 1 with the collaboration of fitness instructors the group fitness programming was started at crunch . offerings such as hip hop aerobics co ed action wrestling and cyked yoga cycling were introduced . 2 in clubs members have access to innovative group fitness classes state of the art equipment personal and group training full service locker rooms and much more . select locations offer an exclusive crunch retail line that can also be purchased from the crunch online store . 3 in january 2014 crunch released its online workout extension called crunch live . this subscription based online video library has over 95 work",
-               "entity2": "gallery esp en best rate guarantee check availability call us room only 1 800 990 8250 hotel air 1 800 219 2727 canada 1 855 478 2811 airport transportation travel agents close best rate guaranteebook your all inclusive stay hotel hotel air arrive departure adults 1 2 3 4 5 6 7 8 children 0 1 2 3 4 5 6 7 8 select property pacifica golf & spa resort the towers at pacifica sunset beach golf & spa resort ros resort & spa los cabos montecristo estates mazatl n emerald bay resort & spa emerald estates luxury villas departure country argentina australia austria bahamas belgium brazil canada chile colombia costa rica denmark ecuador finland france germany greece honduras iceland israel italy japan luxembourg mexico netherlands new zealand nicaragua norway panama paraguay peru portugal puerto rico republic of ireland republic of korea south africa spain sweden switzerland turks and caicos islands united kingdom united states uruguay venezuela departure city akron canton ohio reg . albany ny al",
-               "label": "not duplicate"
-           }
-       ],
-       "few_shot_selection": "fixed",
-       "few_shot_num": 3
-   }
-}
-```
-
-
-```json title="Example 2: Banking Complaints Classification"
-{
-   "task_name": "BankingComplaintsClassification",
-   "task_type": "classification",
-   "dataset": {
-       "label_column": "label",
-       "delimiter": ","
-   },
-   "model": {
-       "provider": "openai",
-       "name": "gpt-3.5-turbo"
-   },
-   "prompt": {
-       "task_guidelines": "You are an expert at understanding bank customers support complaints and queries.\nYour job is to correctly classify the provided input example into one of the following categories.\nCategories:\n{labels}",
-       "output_guidelines": "You will answer with just the the correct output label and nothing else.",
-       "labels": [
-           "activate_my_card",
-           "age_limit",
-           "apple_pay_or_google_pay",
-           "atm_support",
-           "automatic_top_up",
-           "balance_not_updated_after_bank_transfer",
-           "balance_not_updated_after_cheque_or_cash_deposit",
-           "beneficiary_not_allowed",
-           "cancel_transfer",
-           "card_about_to_expire",
-           "card_acceptance",
-           "card_arrival",
-           "card_delivery_estimate",
-           "card_linking",
-           "card_not_working",
-           "card_payment_fee_charged",
-           "card_payment_not_recognised",
-           "card_payment_wrong_exchange_rate",
-           "card_swallowed",
-           "cash_withdrawal_charge",
-           "cash_withdrawal_not_recognised",
-           "change_pin",
-           "compromised_card",
-           "contactless_not_working",
-           "country_support",
-           "declined_card_payment",
-           "declined_cash_withdrawal",
-           "declined_transfer",
-           "direct_debit_payment_not_recognised",
-           "disposable_card_limits",
-           "edit_personal_details",
-           "exchange_charge",
-           "exchange_rate",
-           "exchange_via_app",
-           "extra_charge_on_statement",
-           "failed_transfer",
-           "fiat_currency_support",
-           "get_disposable_virtual_card",
-           "get_physical_card",
-           "getting_spare_card",
-           "getting_virtual_card",
-           "lost_or_stolen_card",
-           "lost_or_stolen_phone",
-           "order_physical_card",
-           "passcode_forgotten",
-           "pending_card_payment",
-           "pending_cash_withdrawal",
-           "pending_top_up",
-           "pending_transfer",
-           "pin_blocked",
-           "receiving_money",
-           "Refund_not_showing_up",
-           "request_refund",
-           "reverted_card_payment?",
-           "supported_cards_and_currencies",
-           "terminate_account",
-           "top_up_by_bank_transfer_charge",
-           "top_up_by_card_charge",
-           "top_up_by_cash_or_cheque",
-           "top_up_failed",
-           "top_up_limits",
-           "top_up_reverted",
-           "topping_up_by_card",
-           "transaction_charged_twice",
-           "transfer_fee_charged",
-           "transfer_into_account",
-           "transfer_not_received_by_recipient",
-           "transfer_timing",
-           "unable_to_verify_identity",
-           "verify_my_identity",
-           "verify_source_of_funds",
-           "verify_top_up",
-           "virtual_card_not_working",
-           "visa_or_mastercard",
-           "why_verify_identity",
-           "wrong_amount_of_cash_received",
-           "wrong_exchange_rate_for_cash_withdrawal"
-       ],
-       "few_shot_examples": "seed.csv",
-       "few_shot_selection": "semantic_similarity",
-       "few_shot_num": 10,
-       "example_template": "Input: {example}\nOutput: {label}"
-   }
-}
-```
-
diff --git a/docs/autolabel/guide/resources/refuel_datasets.md b/docs/autolabel/guide/resources/refuel_datasets.md
deleted file mode 100644
index 10ddb7a..0000000
--- a/docs/autolabel/guide/resources/refuel_datasets.md
+++ /dev/null
@@ -1,27 +0,0 @@
-Autolabel provides datasets out-of-the-box so you can easily get started with LLM-powered labeling. The full list of datasets is below:
-
-| Dataset        | Task Type             |
-| ---------------| ----------------------|
-| banking        | Classification        |
-| civil_comments | Classification        |
-| ledgar         | Classification        |
-| walmart_amazon | Entity Matching       |
-| company        | Entity Matching       |
-| squad_v2       | Question Answering    |
-| sciq           | Question Answering    |
-| conll2003      | Named Entity Matching |
-
-
-## Downloading any dataset
-
-To download a specific dataset, such as `squad_v2`, run:
-```python
-from autolabel import get_data
-
-get_data('civil_comments')
-> Downloading seed example dataset to "seed.csv"...
-> 100% [..............................................................................] 65757 / 65757
-
-> Downloading test dataset to "test.csv"...
-> 100% [............................................................................] 610663 / 610663
-```
\ No newline at end of file
diff --git a/docs/autolabel/guide/resources/synthetic_dataset_generation.md b/docs/autolabel/guide/resources/synthetic_dataset_generation.md
deleted file mode 100644
index 5436583..0000000
--- a/docs/autolabel/guide/resources/synthetic_dataset_generation.md
+++ /dev/null
@@ -1,68 +0,0 @@
-Few shot learning is one of the most powerful tools that autolabel offers to improve the accuracy of LLM generated labels. However, curating a seed dataset to use for few shot learning can be a time consuming and tedious process. To make this process easier, autolabel's LabelingAgent provides a method to generate synthetic datasets. These datasets can be used as seed datasets for few shot learning or any other purpose. This guide will walk you through the process of generating a synthetic dataset using autolabel.
-
-Currently, autolabel supports synthetic dataset generation for classification and entity matching tasks. We plan to add support for other task types in the future.
-
-### **Walkthrough: Creating a Synthetic Dataset for Banking**
-
-<ol>
-<li>The first step is to import the LabelingAgent from autolabel. This is the main class that we will use to generate the synthetic dataset.
-
-```python
-from autolabel import LabelingAgent
-```
-
-</li>
-<li>The next step is to create the task config. Make sure to add the <code>dataset_generation</code> section to the config. This section contains the parameters for the dataset generation process. The <code>guidelines</code> parameter is a string containing the guidelines for the dataset generation task. The <code>num_rows</code> parameter is an integer indicating the number of rows <em><strong>per label</strong></em> to generate in the dataset.
-
-```python
-config = {
-  "task_name": "BankingComplaintsClassification",
-  "task_type": "classification",
-  "dataset": {
-    "label_column": "label",
-    "delimiter": ","
-  },
-  "model": {
-    "provider": "openai",
-    "name": "gpt-3.5-turbo"
-  },
-  "prompt": {
-    "task_guidelines": "You are an expert at understanding bank customers support complaints and queries.\nYour job is to correctly classify the provided input example into one of the following categories.\nCategories:\n{labels}",
-    "output_guidelines": "You will answer with just the the correct output label and nothing else.",
-    "labels": {
-        "activate_my_card": "the customer cannot activate their credit or debit card",
-        "age_limit": "the customer is under the age limit",
-        "apple_pay_or_google_pay": "the customer is having trouble using apple pay or google pay",
-        ... # more labels
-    },
-    "example_template": "Input: {example}\nOutput: {label}"
-  },
-  "dataset_generation": {
-    "num_rows": 5,
-    "guidelines": "You are an expert at generating synthetic data. You will generate a dataset that satisfies the following criteria:\n1. The data should be diverse and cover a wide range of scenarios.\n2. The data should be as realistic as possible, closely mimicking real-world data.\n3. The data should vary in length, some shorter and some longer.\n4. The data should be generated in a csv format.\n\nEach row should contain a realistic bank complaint. Use CSV format, with each line containing just the complaint and nothing else."
-  }
-}
-```
-
-Note that here, we defined <code>labels</code> as a dictionary where the keys are the valid labels and the values are descriptions for those labels. This helps the LLM understand what each label means and can result in a higher quality dataset.
-
-</li>
-<li>Now all that's left is to run the code that generates the dataset!
-
-```python
-agent = LabelingAgent(config)
-ds = agent.generate_synthetic_dataset()
-```
-
-</li>
-</ol>
-
-That's it! You now have a synthetic dataset that you can use for few shot learning or for any other purpose. You can save the dataset to a csv file using the following code:
-
-```python
-ds.save("synthetic_dataset.csv")
-```
-
-### Model and Model Parameters
-
-To edit the model used for synthetic dataset generation, simply change the `model` section of the config. We've found that setting a higher temperature for this task generally results in more realistic datasets. We recommend experimenting with different models and model parameters to see what works best for your use case.
diff --git a/docs/autolabel/guide/tasks/attribute_extraction.md b/docs/autolabel/guide/tasks/attribute_extraction.md
deleted file mode 100644
index 70843e2..0000000
--- a/docs/autolabel/guide/tasks/attribute_extraction.md
+++ /dev/null
@@ -1,117 +0,0 @@
-## Introduction
-
-Attribute Extraction is a task that shows up in real world frequently. This task extracts multiple attributes or features from a single piece of text. For eg. extracting the colour, price and name from a product description paragraph. Instead of making multiple calls to the llm, we can extract all attributes in one call! Additionally, if the attributes are related to each other, doing attribute extraction means that the relationships between the outputs are respected i.e suppose we extract the length of a shirt along with its letter size. Doing attribute extraction would make sure the letter and the integer length are consistent.
-
-## Example [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12kyDbJltfrBW7WxKV38NOQVE-df6IOIT)
-
-### Dataset
-
-Lets walk through using Autolabel for attribute extraction on the ethos dataset. The ethos dataset comprises of hate speech on social media platforms. Every datapoints consists of an exmaple with hate speech and corresponding to it, there are three attributes, i.e violence, gender and directed_vs_generalized.
-
-```json
-{
-  "example": "tweet containing hate speech",
-  "violence": "violent",
-  "directed_vs_generalized": "directed",
-  "gender": "false"
-}
-```
-
-Thus the dataset contains of 4 columns, the example along with the 3 attributes. Here, Autolabel would be given the example input for a new datapoint and told to predict the labels for the 3 attributes.
-
-## Config
-
-In order to run Autolabel, we need a config defining the 3 important things - task, llm and dataset. Let's assume gpt-3.5-turbo as the LLM for this section.
-
-```json
-config = {
-  "task_name": "EthosAttributeExtraction",
-  "task_type": "attribute_extraction",
-  "dataset": {
-    "text_column": "text",
-    "delimiter": ","
-  },
-  "model": {
-    "provider": "openai",
-    "name": "gpt-3.5-turbo"
-  },
-  "prompt": {
-    "task_guidelines": "You are an expert at classifying hate speech and identifying the type of hate speech. Read the following tweets and extract the following attributes from the text.",
-    "attributes": [
-      {
-        "name": "violence",
-        "options": ["not_violent", "violent"],
-        "description": "If the tweet mentions violence towards a person or a group."
-      },
-      {
-        "name": "directed_vs_generalized",
-        "options": [
-          "generalized",
-          "directed"
-        ],
-        "description": "If the hate speech is generalized towards a group or directed towards a specific person."
-      },
-      {
-        "name": "gender",
-        "options": [
-          "true",
-          "false"
-        ],
-        "description": "If the hate speech uses gendered language and attacks a particular gender."
-      }
-    ],
-    "few_shot_examples": "seed.csv",
-    "few_shot_selection": "fixed",
-    "few_shot_num": 5,
-    "example_template": "Text: {text}\nOutput: {output_dict}"
-  }
-}
-```
-
-The `task_type` sets up the config for a specific task, attribute_extraction in this case.
-
-Take a look at the prompt section of the config. This defines the settings related to defining the task and the machinery around it.
-
-The `task_guidelines` key is the most important key, it defines the task for the LLM to understand and execute on. In this case, we first set up the task and tell the model the kind of data present in the dataset, by telling it that it is an expert at classifying hate speech.
-
-The `attributes` key is the most important key for defining attribute extraction well. For every attribute, we have atleast 2 keys -  
- a. `name` - This is the name of the attribute.
-b. `description` - This is the description of an attribute. This describes the attribute more concretely and prompts the model to extract the corresponding attribute.
-c. `options` - You can also define a list of options for the LLM. This is an optional field. In case the attribute has a list of values from which to choose the value, fill this list. Otherwise, the attribute is prompted to be any possible textual value.
-
-The `example_template` is one of the most important keys to set for a task. This defines the format of every example that will be sent to the LLM. This creates a prompt using the columns from the input dataset. Here we define the `output_dict` key, which is used in the example template for attribute extraction tasks. This will create a json of all the attributes, as key value pairs. The LLM is also prompted to output the attributes in a json.
-
-### Run the task
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = AutolabelDataset('test.csv', config = config)
-agent.plan(ds)
-agent.run(ds, max_items = 100)
-```
-
-### Evaluation metrics
-
-On running the above config, this is an example output expected for labeling 100 items.
-
-```
-Actual Cost: 0.0665
-┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
-┃ violence:… ┃ violence:… ┃ violence:… ┃ directed_… ┃ directed… ┃ directed_… ┃ gender:s… ┃ gender:co… ┃ gender:a… ┃
-┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
-│ 100        │ 1.0        │ 0.89       │ 100        │ 1.0       │ 0.89       │ 100       │ 1.0        │ 0.94      │
-└────────────┴────────────┴────────────┴────────────┴───────────┴────────────┴───────────┴────────────┴───────────┘
-```
-
-**Accuracy** - This is calculated by taking the exact match of the predicted tokens and their correct class. This may suffer from class imbalance.
-
-**Completion Rate** - There can be errors while running the LLM related to labeling for eg. the LLM may give a label which is not in the label list or provide an answer which is not parsable by the library. In this cases, we mark the example as not labeled successfully. The completion rate refers to the proportion of examples that were labeled successfully.
-
-### Confidence
-
-You can calculate per attribute confidence metric as well by setting compute_confidence as true in the model config. This can help you decide which examples to keep per attribute.
-
-### Notebook
-
-You can find a Jupyter notebook with code that you can run on your own [here](https://github.com/refuel-ai/autolabel/blob/main/examples/ethos/example_ethos.ipynb).
diff --git a/docs/autolabel/guide/tasks/classification_task.md b/docs/autolabel/guide/tasks/classification_task.md
deleted file mode 100644
index 4045e3d..0000000
--- a/docs/autolabel/guide/tasks/classification_task.md
+++ /dev/null
@@ -1,170 +0,0 @@
-## Introduction
-
-Text classification is a fundamental task in natural language processing (NLP) that involves categorizing textual data into predefined classes or categories. It is employed in various applications such as sentiment analysis, spam detection, topic classification, intent recognition, and document categorization and can be used in any setting where there are well defined categories which the LLM can understand and put an input into.
-
-## Example [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1x_CTOBG8uKV6O4wxsqWfaBL6G88szrDM)
-
-### Dataset
-
-Lets walk through using Autolabel for text classification on the Banking77 dataset. The Banking77 dataset comprises of 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Every datapoint consists of an example and its corresponding label as shown below. The label belongs to a set of 77 predefined intents that the customer had for the particular datapoint for eg. activate_my_card, card_delivery_estimate, get_physical_card.
-
-```json
-{
-    "example": "What can I do if my card still hasn't arrived after 2 weeks?",
-    "label": "card_arrival"
-}
-```
-
-Thus the dataset consists of just two columns, example and label. Here, Autolabel would be given the example input for a new datapoint and told to predict the label column which in this case is label.
-
-### Config
-
-In order to run Autolabel, we need a config defining the 3 important things - task, llm and dataset. Let's assume gpt-3.5-turbo as the LLM for this section.
-
-```json
-config = {
-    "task_name": "BankingClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": """You are an expert at understanding banking transaction complaints.\nYour job is to correctly label the provided input example into one of the following {num_labels} categories:\n{labels}""",
-        "output_guidelines": "You will just return one line consisting of the label for the given example.",
-        "labels": [
-            "activate_my_card",
-            "age_limit",
-            "apple_pay_or_google_pay",
-            ...
-        ],
-        "example_template": "Example: {example}\nOutput: {label}"
-    }
-}
-```
-The `task_type` sets up the config for a specific task, classification in this case.
-
-Take a look at the prompt section of the config. This defines the settings related to defining the task and the machinery around it.  
-
-The `task_guidelines` key is the most important key, it defines the task for the LLM to understand and execute on. In this case, we first set up the task and tell the model the kind of data present in the dataset, by telling it that it is an expert at understanding banking transaction complaints. Next, we define the task more concretely using the num_labels and labels appropriately. `{num_labels}` will be internally translated by the library to be the number of elements in the `labels` list (defined below).  `{labels}` will be translated to be all the labels in the `labels` list separated by a newline. These are essential for setting up classification tasks by telling it the labels that it is constrained to, along with any meaning associated with a label.  
-
-The `labels` key defines the list of possible labels for the banking77 dataset which is a list of 77 possible labels.  
-
-The `example_template` is one of the most important keys to set for a task. This defines the format of every example that will be sent to the LLM. This creates a prompt using the columns from the input dataset, and sends this prompt to the LLM hoping for the llm to generate the column defined under the `label_column`, which is label in our case. For every input, the model will be given the example with all the columns from the datapoint filled in according to the specification in the `example_template`. The `label_column` will be empty, and the LLM will generate the label. The `example_template` will be used to format all seed examples.  
-
-### Few Shot Config
-
-Let's assume we have access to a dataset of labeled seed examples. Here is a config which details how to use it.
-
-```json
-config = {
-    "task_name": "BankingClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": """You are an expert at understanding banking transaction complaints.\nYour job is to correctly label the provided input example into one of the following {num_labels} categories:\n{labels}""",
-        "output_guidelines": "You will just return one line consisting of the label for the given example.",
-        "labels": [
-            "activate_my_card",
-            "age_limit",
-            "apple_pay_or_google_pay",
-            ...
-        ],
-        "few_shot_examples": "../examples/banking/seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 5,
-        "example_template": "Example: {example}\nOutput: {label}"
-    }
-}
-```
-
-The `few_shot_examples` key defines the seed set of labeled examples that are present for the model to learn from. A subset of these examples will be picked while querying the LLM in order to help it understand the task better, and understand corner cases.  
-
-For the banking dataset, we found `semantic_similarity` search to work really well. This looks for examples similar to a query example from the seed set and sends those to the LLM when querying for a particular input. This is defined in the `few_shot_selection` key.  
-
-`few_shot_num` defines the number of examples selected from the seed set and sent to the LLM. Experiment with this number based on the input token budget and performance degradation with longer inputs.
-
-### Run the task
-
-```py
-from autolabel import LabelingAgent
-agent = LabelingAgent(config)
-ds = AutolabelDataset('data/banking77.csv', config = config)
-agent.plan(ds)
-agent.run(ds, max_items = 100)
-```
-
-### Evaluation metrics
-
-On running the above config, this is an example output expected for labeling 100 items.
-```
-Cost in $=0.00, support=50, threshold=-inf, accuracy=0.6600, completion_rate=1.0000
-Actual Cost: 0.0058579999999999995
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.76     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-**Accuracy** - We use accuracy as the main metric for evaluating classification tasks. This is done by checking the fraction of examples which are given the correct label in the training dataset.
-
-**Completion Rate** - There can be errors while running the LLM related to labeling for eg. the LLM may give a label which is not in the label list or provide an answer which is not parsable by the library. In this cases, we mark the example as not labeled successfully. The completion rate refers to the proportion of examples that were labeled successfully.
-
-### Notebook
-You can find a Jupyter notebook with code that you can run on your own [here](https://github.com/refuel-ai/autolabel/blob/main/examples/banking/example_banking.ipynb)
-
-## Classification Tasks with a Large Number of Classes
-
-For classification tasks with a wide variety of possible classes, it is beneficial to run autolabel with `label_selection` turned on. In this mode, Autolabel will prune the list of possible classes to only include those that are similar to the example being labeled. This not only helps improve accuracy, but also substantially reduces labeling costs, as the size of the prompt decreases when classes are pruned.
-
-To enable label_selection, simply set `label_selection` to `true` in your config file. Similarly, you can choose how many classes to select in the similarity search by setting `label_selection_count` to a value of your choosing.
-
-```json
-    "label_selection": true,
-    "label_selection_count": 10
-```
-
-In this example, the list of classes will be reduced to only the 10 classes most similar to the example being labeled.
-
-```json
-config = {
-    "task_name": "BankingClassification",
-    "task_type": "classification",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": """You are an expert at understanding banking transaction complaints.\nYour job is to correctly label the provided input example into one of the following {num_labels} categories:\n{labels}""",
-        "output_guidelines": "You will just return one line consisting of the label for the given example.",
-        "labels": [
-            "activate_my_card",
-            "age_limit",
-            "apple_pay_or_google_pay",
-            ...
-        ],
-        "few_shot_examples": "../examples/banking/seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 5,
-        "example_template": "Example: {example}\nOutput: {label}",
-        "label_selection": true,
-        "label_selection_count": 10
-    }
-}
-```
diff --git a/docs/autolabel/guide/tasks/entity_matching_task.md b/docs/autolabel/guide/tasks/entity_matching_task.md
deleted file mode 100644
index 0d7ccb2..0000000
--- a/docs/autolabel/guide/tasks/entity_matching_task.md
+++ /dev/null
@@ -1,168 +0,0 @@
-## Introduction
-
-Entity matching in natural language processing (NLP) is a task that involves identifying and matching entities from different sources or datasets based on various fields or attributes. The goal is to determine if two entities refer to the same real-world object or entity, even if they are described differently or come from different data sources.
-
-## Example [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kOoUhUY8rmISxVpQJETQ3Xc9TLrNWiRn#scrollTo=c93fae0b)
-
-### Dataset
-
-Lets walk through using Autolabel for entity matching on the Walmart-Amazon dataset. This dataset consists of duplicate products listed on both Walmart and Amazon. These products would have different names and descriptions but would be the same product. The dataset consists of such examples, where given the name and the description, the task is to predict if the products are duplicate or not. An example from the Walmart-Amazon dataset,
-
-```json
-{
-    "entity1": "Title: zotac geforce gt430 1gb ddr3 pci-express 2.0 graphics card; Category: electronics - general; Brand: zotac; ModelNo: zt-40604-10l; Price: 88.88;",
-    "entity2": "Title: evga geforce gts450 superclocked 1 gb gddr5 pci-express 2.0 graphics card 01g-p3-1452-tr; Category: graphics cards; Brand: evga; ModelNo: 01g-p3-1452-tr; Price: 119.88;",
-    "label": "not duplicate"
-}
-```
-
-The the dataset consists of two columns `entity1` and `entity2` which define the two entities. There could also be multiple columns defining an entity. The `label` column here defines if the two entities are duplicates or not.
-
-### Config
-
-In order to run Autolabel, we need a config defining the 3 important things - task, llm and dataset. Let's assume gpt-3.5-turbo as the LLM for this section.
-
-```json
-config = {
-    "task_name": "ProductCatalogEntityMatch",
-    "task_type": "entity_matching",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at identifying duplicate products from online product catalogs.\nYou will be given information about two product entities, and your job is to tell if they are the same (duplicate) or different (not duplicate). Your answer must be from one of the following options:\n{labels}",
-        "labels": [
-            "duplicate",
-            "not duplicate"
-        ],
-        "few_shot_examples": [
-            {
-                "entity1": "Title: lexmark extra high yield return pgm print cartridge - magenta; Category: printers; Brand: lexmark; ModelNo: c782u1mg; Price: 214.88;",
-                "entity2": "Title: lexmark 18c1428 return program print cartridge black; Category: inkjet printer ink; Brand: lexmark; ModelNo: 18c1428; Price: 19.97;",
-                "label": "not duplicate"
-            },
-            {
-                "entity1": "Title: edge tech proshot 4gb sdhc class 6 memory card; Category: usb drives; Brand: edge tech; ModelNo: pe209780; Price: 10.88;",
-                "entity2": "Title: 4gb edge proshot sdhc memory card class6; Category: computers accessories; Brand: edge; ModelNo: nan; Price: 17.83;",
-                "label": "duplicate"
-            },
-            {
-                "entity1": "Title: tomtom one carry case; Category: gps; Brand: tomtom; ModelNo: 9n00 .181; Price: 19.96;",
-                "entity2": "Title: tomtom one carrying case; Category: cases; Brand: tomtom; ModelNo: 9n00 .181; Price: 4.99;",
-                "label": "duplicate"
-            },
-            {
-                "entity1": "Title: iosafe rugged 250gb usb 3.0 portable external hard drive; Category: hard drives; Brand: iosafe; ModelNo: pa50250u5yr; Price: 249.99;",
-                "entity2": "Title: lacie rugged all-terrain 500 gb firewire 800 firewire 400 usb 2.0 portable external hard drive 301371; Category: external hard drives; Brand: lacie; ModelNo: 301371; Price: nan;",
-                "label": "not duplicate"
-            }
-        ],
-        "few_shot_selection": "fixed",
-        "few_shot_num": 3,
-        "example_template": "Entity1: {entity1}\nEntity2: {entity2}\nOutput: {label}"
-    }
-}
-```
-The `task_type` sets up the config for a specific task, entity_matching in this case.
-
-Take a look at the prompt section of the config. This defines the settings related to defining the task and the machinery around it.  
-
-The `task_guidelines` key is the most important key, it defines the task for the LLM to understand and execute on. In this case, we first set up the task and tell the model the kind of data present in the dataset, by telling it that it is an expert at identifying duplicate products. Next we explain the task to the model, saying that it has two identify if the given products are duplicate or not. We also make the output format clear by telling the model it has to choose from the options duplicate or not duplicate. 
-
-The `example_template` is one of the most important keys to set for a task. This defines the format of every example that will be sent to the LLM. This creates a prompt using the columns from the input dataset, and sends this prompt to the LLM hoping for the llm to generate the column defined under the `label_column`, which is answer in our case. For every input, the model will be given the example with all the columns from the datapoint filled in according to the specification in the `example_template`. The `label_column` will be empty, and the LLM will generate the label. The `example_template` will be used to format all seed examples. Here we give the model both the entities separated by newlines and ask if the entities are duplicate or not duplicate.
-
-The `few_shot_examples` here is a list of json inputs which define handpicked examples to use as seed examples for the model. These labeled examples help the model understand the task better and how it supposed to answer a question. If there is a larger number of examples, we can specify a path to a csv instead of a list of examples.
-
-`few_shot_num` defines the number of examples selected from the seed set and sent to the LLM. Experiment with this number based on the input token budget and performance degradation with longer inputs.
-
-`few_shot_selection` is set to fixed in this case as we want to use all examples as seed examples. However, if we want to use a subset of examples as seed examples from a larger set, we can set the appropriate strategy like `semantic_similarity` here to get dynamic good seed examples.
-
-### Alternate config with multiple columns
-
-Let's consider the case in which there are multiple columns in the dataset which are combined to create an input for the model.
-
-```json
-config = {
-    "task_name": "ProductCatalogEntityMatch",
-    "task_type": "entity_matching",
-    "dataset": {
-        "label_column": "label",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at identifying duplicate products from online product catalogs.\nYou will be given information about two product entities, and your job is to tell if they are the same (duplicate) or different (not duplicate). Your answer must be from one of the following options:\n{labels}",
-        "labels": [
-            "duplicate",
-            "not duplicate"
-        ],
-        "example_template": "Title of entity1: {Title_entity1}; category of entity1: {Category_entity1}; brand of entity1: {Brand_entity1}; model number of entity1: {ModelNo_entity1}; price of entity1: {Price_entity1}\nTitle of entity2: {Title_entity2}; category of entity2: {Category_entity2}; brand of entity2: {Brand_entity2}; model number of entity2: {ModelNo_entity2}; price of entity2: {Price_entity2}\nDuplicate or not: {label}",
-        "few_shot_examples": [
-            {
-                "Title_entity1": "lexmark extra high yield return pgm print cartridge - magenta",
-                "Category_entity1": "printers",
-                "Brand_entity1": "lexmark",
-                "ModelNo_entity1": "c782u1mg",
-                "Price_entity1": "214.88",
-                "Title_entity2": "lexmark 18c1428 return program print cartridge black",
-                "Category_entity2": "inkjet printer ink",
-                "Brand_entity2": "lexmark",
-                "ModelNo_entity2": "18c1428",
-                "Price_entity2": "19.97",
-                "label": "not duplicate"
-            },
-            {
-                "Title_entity1": "edge tech proshot 4gb sdhc class 6 memory card",
-                "Category_entity1": "usb drives",
-                "Brand_entity1": "edge tech",
-                "ModelNo_entity1": "pe209780",
-                "Price_entity1": "10.88",
-                "Title_entity2": "4gb edge proshot sdhc memory card class6",
-                "Category_entity2": "computers accessories",
-                "Brand_entity2": "edge",
-                "ModelNo_entity2": "nan",
-                "Price_entity2": "17.83",
-                "label": "duplicate"
-            }
-        ],
-        "few_shot_selection": "fixed",
-        "few_shot_num": 2
-    }
-}
-```
-
-Notice how in this case, we specify how the different columns defining different aspects of every column are stitched together to form the final example template.
-
-### Run the task
-
-```py
-from autolabel import LabelingAgent
-agent = LabelingAgent(config)
-ds = AutolabelDataset('data/walmart_amazon_test.csv', config = config)
-agent.plan(ds)
-agent.run(ds, max_items = 100)
-```
-
-### Evaluation metrics
-
-On running the above config, this is an example output expected for labeling 100 items.
-```
-┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 100     │ -inf      │ 0.96     │ 1.0             │
-└─────────┴───────────┴──────────┴─────────────────┘
-```
-
-**Accuracy** - This measures the proportion of examples which are marked correctly by the model - for eg which mark duplicate entities correctly.
-
-**Completion Rate** - There can be errors while running the LLM related to labeling for eg. the LLM may give a label which is not in the label list or provide an answer which is not parsable by the library. In this cases, we mark the example as not labeled successfully. The completion rate refers to the proportion of examples that were labeled successfully.
diff --git a/docs/autolabel/guide/tasks/multilabel_classification_task.md b/docs/autolabel/guide/tasks/multilabel_classification_task.md
deleted file mode 100644
index 5d31653..0000000
--- a/docs/autolabel/guide/tasks/multilabel_classification_task.md
+++ /dev/null
@@ -1,133 +0,0 @@
-## Introduction
-
-Multilabel text classification is a fundamental task in natural language processing (NLP) where textual data is categorized into predefined classes or categories. It expands upon traditional text classification by assigning multiple labels to each text instance. This approach finds applications in sentiment analysis, spam detection, topic classification, intent recognition, and document categorization. By considering multiple labels, it allows for a more nuanced representation of text data, accommodating scenarios where multiple topics or attributes are associated with a document. Multilabel text classification enables a flexible and comprehensive approach to categorizing textual data, providing a richer understanding of content and facilitating more nuanced decision-making in various NLP applications.
-
-## Example [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1so1yjErzejgGXzNxUAgCNxSYPtI2Rl6E)
-
-### Dataset
-
-Lets walk through using Autolabel for multilabel text classification on the [sem_eval_2018_task_1 dataset](https://huggingface.co/datasets/sem_eval_2018_task_1) which we call twitter-emotion-detection for clarity. The twitter-emotion-detection dataset comprises of 10,983 English tweets and 11 emotions. If no emotions were selected for a row, we classified it as `neutral`.
-
-```json
-{
-  "example": "I blew that opportunity -__- #mad",
-  "label": "anger, disgust, sadness"
-}
-```
-
-Thus the dataset consists of just two columns, example and labels. Here, Autolabel would be given the example input for a new datapoint and told to predict the label column which in this case is labels.
-
-### Config
-
-In order to run Autolabel, we need a config defining the 3 important things - task, llm and dataset. Let's assume gpt-3.5-turbo as the LLM for this section.
-
-```json
-config = {
-    "task_name": "EmotionClassification",
-    "task_type": "multilabel_classification",
-    "dataset": {
-        "label_column": "labels",
-        "label_separator": ", ",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}",
-        "output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
-        "labels": [
-            "neutral",
-            "anger",
-            "anticipation",
-            ...
-        ],
-        "example_template": "Input: {example}\nOutput: {labels}"
-    }
-}
-```
-
-The `task_type` sets up the config for a specific task, multilabel_classification in this case.
-
-Take a look at the prompt section of the config. This defines the settings related to defining the task and the machinery around it.
-
-The `task_guidelines` key is the most important key, it defines the task for the LLM to understand and execute on. In this case, we first set up the task and tell the model the kind of data present in the dataset, by telling it that it is an expert at classifying tweets. Next, we define the task more concretely using labels appropriately. `{labels}` will be translated to be all the labels in the `labels` list separated by a newline. These are essential for setting up classification tasks by telling it the labels that it is constrained to, along with any meaning associated with a label.
-
-The `labels` key defines the list of possible labels for the twitter-emotion-detection dataset which is a list of 12 possible labels.
-
-The `example_template` is one of the most important keys to set for a task. This defines the format of every example that will be sent to the LLM. This creates a prompt using the columns from the input dataset, and sends this prompt to the LLM hoping for the llm to generate the column defined under the `label_column`, which is labels in our case. For every input, the model will be given the example with all the columns from the datapoint filled in according to the specification in the `example_template`. The `label_column` will be empty, and the LLM will generate the labels. The `example_template` will be used to format all seed examples.
-
-### Few Shot Config
-
-Let's assume we have access to a dataset of labeled seed examples. Here is a config which details how to use it.
-
-```json
-config = {
-    "task_name": "EmotionClassification",
-    "task_type": "multilabel_classification",
-    "dataset": {
-        "label_column": "labels",
-        "label_separator": ", ",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}",
-        "output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
-        "labels": [
-            "neutral",
-            "anger",
-            "anticipation",
-            ...
-        ],
-        "few_shot_examples": "seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 5,
-        "example_template": "Input: {example}\nOutput: {labels}"
-    }
-}
-
-```
-
-The `few_shot_examples` key defines the seed set of labeled examples that are present for the model to learn from. A subset of these examples will be picked while querying the LLM in order to help it understand the task better, and understand corner cases.
-
-For the twitter dataset, we found `semantic_similarity` search to work really well. This looks for examples similar to a query example from the seed set and sends those to the LLM when querying for a particular input. This is defined in the `few_shot_selection` key.
-
-`few_shot_num` defines the number of examples selected from the seed set and sent to the LLM. Experiment with this number based on the input token budget and performance degradation with longer inputs.
-
-### Run the task
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = AutolabelDataset('twitter_emotion_detection.csv', config = config)
-agent.plan(ds)
-agent.run(ds, max_items = 100)
-```
-
-### Evaluation metrics
-
-On running the above config, this is an example output expected for labeling 100 items.
-
-```
-Actual Cost: 0.0025
-┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ f1     ┃ support ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 0.4507 │ 100     │ 0.08     │ 1.0             │
-└────────┴─────────┴──────────┴─────────────────┘
-```
-
-**Accuracy** - This is calculated by taking the exact match of the predicted tokens and their correct class. This may suffer from class imbalance.
-
-**F1** - This is calculated using the precision and recall of the predicted tokens and their classes. We use a macro average to get to one F1 score for all classes.
-
-**Completion Rate** - There can be errors while running the LLM related to labeling for eg. the LLM may give a label which is not in the label list or provide an answer which is not parsable by the library. In this cases, we mark the example as not labeled successfully. The completion rate refers to the proportion of examples that were labeled successfully.
-
-### Notebook
-
-You can find a Jupyter notebook with code that you can run on your own [here](https://github.com/refuel-ai/autolabel/blob/main/examples/twitter_emotion_detection/example_twitter_emotion_detection.ipynb).
diff --git a/docs/autolabel/guide/tasks/named_entity_recognition_task.md b/docs/autolabel/guide/tasks/named_entity_recognition_task.md
deleted file mode 100644
index e39ad7a..0000000
--- a/docs/autolabel/guide/tasks/named_entity_recognition_task.md
+++ /dev/null
@@ -1,90 +0,0 @@
-## Introduction
-
-Named Entity Recognition (NER) is a crucial task in natural language processing (NLP) that involves identifying and classifying named entities in text. Named entities refer to specific individuals, organizations, locations, dates, quantities, and other named entities present in the text. The goal of NER is to extract and classify these entities accurately, providing valuable information for various NLP applications such as information extraction, question answering, and sentiment analysis.
-
-## Example [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1M87KAnjf0KEtAI69BnYc_pwsjfkTvMrK#scrollTo=c93fae0b)
-
-### Dataset
-
-Lets walk through using Autolabel for named entity recognition on the CONLL2003 dataset. The CONLL2003 dataset comprises of sentences with entities in the sentence labeled LOC (location), ORG (organization), PER (person) or MISC (Miscellaneous).  
-
-```json
-{
-    "example": "The role of the 70,000 mainly Kurdish village guards who fight Kurdistan Workers Party ( PKK ) guerrillas in the southeast has been questioned recently after media allegations that many of them are involved in common crime .",
-    "CategorizedLabels": "{'Location': [], 'Organization': ['Kurdistan Workers Party', 'PKK'], 'Person': [], 'Miscellaneous': ['Kurdish']}"
-}
-```
-
-Thus the dataset consists of the `example` and `CategorizedLabels` columns. Here `example` mentions the sentence which needs to be labeled. The `CategorizedLabels` contains the entities for every label as a list.
-
-### Config
-
-In order to run Autolabel, we need a config defining the 3 important things - task, llm and dataset. Let's assume gpt-3.5-turbo as the LLM for this section.
-
-```py
-config = {
-    "task_name": "PersonLocationOrgMiscNER",
-    "task_type": "named_entity_recognition",
-    "dataset": {
-        "label_column": "CategorizedLabels",
-        "text_column": "example"
-    },
-    "model": {
-        "provider": "anthropic",
-        "name": "claude-v1"
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at extracting Person, Organization, Location, and Miscellaneous entities from text. Your job is to extract named entities mentioned in text, and classify them into one of the following categories.\nCategories:\n{labels}\n ",
-        "labels": [
-            "Location",
-            "Organization",
-            "Person",
-            "Miscellaneous"
-        ],
-        "example_template": "Example: {example}\nOutput: {CategorizedLabels}",
-        "few_shot_examples": "data/conll2003_seed.csv",
-        "few_shot_selection": "semantic_similarity",
-        "few_shot_num": 5
-    }
-}
-```
-The `task_type` sets up the config for a specific task, named_entity_recognition in this case.
-
-Take a look at the prompt section of the config. This defines the settings related to defining the task and the machinery around it.  
-
-The `task_guidelines` key is the most important key, it defines the task for the LLM to understand and execute on. In this case, we first set up the task and tell the model the kind of data present in the dataset, by telling it that it is an expert at extracting entities from text and classifying them into the necessary labels. Next, we tell the model the list of categories that it should classify every entity into. This ensures that every entity is assigned to one category.  
-
-The `example_template` is one of the most important keys to set for a task. This defines the format of every example that will be sent to the LLM. This creates a prompt using the columns from the input dataset, and sends this prompt to the LLM hoping for the llm to generate the column defined under the `label_column`, which is `CategorizedLabels` in our case. For every input, the model will be given the example with all the columns from the datapoint filled in according to the specification in the `example_template`. The `label_column` will be empty, and the LLM will generate the label. The `example_template` will be used to format all seed examples.  
-
-The `few_shot_examples` here is a path to a csv which defines a set of labeled examples which the model can use to understand the task better. These examples will be used as a reference by the model.
-
-`few_shot_num` defines the number of examples selected from the seed set and sent to the LLM. Experiment with this number based on the input token budget and performance degradation with longer inputs.
-
-`few_shot_selection` is set to `semantic_similarity` in this case as we want to use a subset of examples as seed examples from a larger set to get dynamically good seed examples.
-
-### Run the task
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = AutolabelDataset('examples/squad_v2/test.csv', config = config)
-agent.plan(ds, max_items = 100)
-agent.run(ds, max_items = 100)
-```
-
-### Evaluation metrics
-
-On running the above config, this is an example output expected for labeling 100 items.
-```
-┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ f1     ┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 0.7834 │ 100     │ -inf      │ 0.7834   │ 1.0             │
-└────────┴─────────┴───────────┴──────────┴─────────────────┘
-```
-
-**Accuracy** - This is calculated by taking the exact match of the predicted tokens and their correct class. This may suffer from class imbalance.
-
-**F1** - This is calculated using the precision and recall of the predicted tokens and their classes. We use a macro average to get to one F1 score for all classes.
-
-**Completion Rate** - There can be errors while running the LLM related to labeling for eg. the LLM may provide an answer which is not parsable by the library. In this cases, we mark the example as not labeled successfully. The completion rate refers to the proportion of examples that were labeled successfully.
diff --git a/docs/autolabel/guide/tasks/question_answering_task.md b/docs/autolabel/guide/tasks/question_answering_task.md
deleted file mode 100644
index fe18e46..0000000
--- a/docs/autolabel/guide/tasks/question_answering_task.md
+++ /dev/null
@@ -1,138 +0,0 @@
-## Introduction
-
-Question answering is the most fundamental task that can be solved using LLMs. Most tasks can be reduced to some form of question answering where the model is optionally given some context and then asked to answer a question. There can be a broad classification of question answering tasks into 2 categories -  
-
-1. Open Book QA - In this variant, the model is given a context along with a question and then asked to answer using the context. Here, we do not rely on knowledge present in the model parameters and instead rely on the reasoning abilities and commonsense properties of the model to answer correctly.
-
-2. Closed Book QA - In this variant, the model is just given a question, without any context or knowledge source and asked to answer based on pretrained knowledge. This requires more knowledge to be present in the model parameters and thus favours bigger LLMs.
-
-In addition to context, question answering tasks can also differ in the way that the answers are generated. The easiest form is one where there is a predefined set of options (for eg. yes or no) and the model needs to choose from one of these options. Another variant allows separate options for each question similar to SAT questions. The last variant is one where the model is free to generate its own answers. This variant is harder to evaluate because multiple answers could mean the same thing.
-
-## Example [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13DiE1dfG7pYGV2FLWkxPSbyTbABIm34I#scrollTo=c93fae0b)
-
-### Dataset
-
-Lets walk through using Autolabel for question answering on the Squad dataset. The Squad dataset comprises of 100k questions and answers along with a context for each question which contains the answer for the question. Additionally, the correct answer is a continuous text span from the context. However, in addition to correct answers, it also contains 50k pairs where the question is unanswerable given the context, that is, the context does not have enough information to answer the question correctly. Here is an example datapoint from the dataset,
-
-```json
-{
-    "question": "When did Beyonce start becoming popular?",
-    "context": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles 'Crazy in Love' and 'Baby Boy'.",
-    "answer": "in the late 1990s"
-}
-```
-
-Thus the dataset consists of the `question`, `context` and `answer`. For datasets like SciQ, there may be an additional field called `options` which is a list of strings which are possible answers for a particular question.
-
-### Config
-
-In order to run Autolabel, we need a config defining the 3 important things - task, llm and dataset. Let's assume gpt-3.5-turbo as the LLM for this section.
-
-```json
-config = {
-    "task_name": "OpenbookQAWikipedia",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. The answer is a continuous span of words from the context. Use the context to answer the question. If the question cannot be answered using the context, answer the question as unanswerable.",
-        "few_shot_examples": [
-            {
-                "question": "What was created by the modern Conservative Party in 1859 to define basic Conservative principles?",
-                "answer": "unanswerable",
-                "context": "The modern Conservative Party was created out of the 'Pittite' Tories of the early 19th century. In the late 1820s disputes over political reform broke up this grouping. A government led by the Duke of Wellington collapsed amidst dire election results. Following this disaster Robert Peel set about assembling a new coalition of forces. Peel issued the Tamworth Manifesto in 1834 which set out the basic principles of Conservatism; – the necessity in specific cases of reform in order to survive, but an opposition to unnecessary change, that could lead to 'a perpetual vortex of agitation'. Meanwhile, the Whigs, along with free trade Tory followers of Robert Peel, and independent Radicals, formed the Liberal Party under Lord Palmerston in 1859, and transformed into a party of the growing urban middle-class, under the long leadership of William Ewart Gladstone."
-            },
-            {
-                "question": "When is King Mom symbolically burnt?",
-                "answer": "On the evening before Lent",
-                "context": "Carnival means weeks of events that bring colourfully decorated floats, contagiously throbbing music, luxuriously costumed groups of celebrants of all ages, King and Queen elections, electrifying jump-ups and torchlight parades, the Jouvert morning: the Children's Parades and finally the Grand Parade. Aruba's biggest celebration is a month-long affair consisting of festive 'jump-ups' (street parades), spectacular parades and creative contests. Music and flamboyant costumes play a central role, from the Queen elections to the Grand Parade. Street parades continue in various districts throughout the month, with brass band, steel drum and roadmarch tunes. On the evening before Lent, Carnival ends with the symbolic burning of King Momo."
-            },
-            {
-                "question": "How far does the Alps range stretch?",
-                "answer": "the Mediterranean Sea north above the Po basin, extending through France from Grenoble, eastward through mid and southern Switzerland",
-                "context": "The Alps are a crescent shaped geographic feature of central Europe that ranges in a 800 km (500 mi) arc from east to west and is 200 km (120 mi) in width. The mean height of the mountain peaks is 2.5 km (1.6 mi). The range stretches from the Mediterranean Sea north above the Po basin, extending through France from Grenoble, eastward through mid and southern Switzerland. The range continues toward Vienna in Austria, and east to the Adriatic Sea and into Slovenia. To the south it dips into northern Italy and to the north extends to the south border of Bavaria in Germany. In areas like Chiasso, Switzerland, and Neuschwanstein, Bavaria, the demarcation between the mountain range and the flatlands are clear; in other places such as Geneva, the demarcation is less clear. The countries with the greatest alpine territory are Switzerland, France, Austria and Italy."
-            }
-        ],
-        "few_shot_selection": "fixed",
-        "few_shot_num": 3,
-        "example_template": "Context: {context}\nQuestion: {question}\nAnswer: {answer}"
-    }
-}
-```
-The `task_type` sets up the config for a specific task, question_answering in this case.
-
-Take a look at the prompt section of the config. This defines the settings related to defining the task and the machinery around it.  
-
-The `task_guidelines` key is the most important key, it defines the task for the LLM to understand and execute on. In this case, we first set up the task and tell the model the kind of data present in the dataset, by telling it that it is an expert at understanding wikipedia articles. Next, we define the task more concretely by telling the model how to answer the question given the context. We tell the model that the answer is a continuous text span from the context and that in some cases, the answer can be unanswerable and how the model should handle such questions.  
-
-The `example_template` is one of the most important keys to set for a task. This defines the format of every example that will be sent to the LLM. This creates a prompt using the columns from the input dataset, and sends this prompt to the LLM hoping for the llm to generate the column defined under the `label_column`, which is answer in our case. For every input, the model will be given the example with all the columns from the datapoint filled in according to the specification in the `example_template`. The `label_column` will be empty, and the LLM will generate the label. The `example_template` will be used to format all seed examples. Here we also see the ordering of the context followed by question and answer, and also see the `Context: ` string to inform the model which part of the text is the context.
-
-The `few_shot_examples` here is a list of json inputs which define handpicked examples to use as seed examples for the model. These labeled examples help the model understand the task better and how it supposed to answer a question. If there is a larger number of examples, we can specify a path to a csv instead of a list of examples.
-
-`few_shot_num` defines the number of examples selected from the seed set and sent to the LLM. Experiment with this number based on the input token budget and performance degradation with longer inputs.
-
-`few_shot_selection` is set to fixed in this case as we want to use all examples as seed examples. However, if we want to use a subset of examples as seed examples from a larger set, we can set the appropriate strategy like `semantic_similarity` here to get dynamic good seed examples.
-
-### Alternate config for ClosedBook QA
-
-Let's consider a dataset like sciq which is a closed book QA with multiple choice questions. Here we have an example config for this dataset,
-
-```json
-config = {
-    "task_name": "ClosedBookQAScienceQuestions",
-    "task_type": "question_answering",
-    "dataset": {
-        "label_column": "answer",
-        "delimiter": ","
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo",
-        "params": {}
-    },
-    "prompt": {
-        "task_guidelines": "You are an expert at answering science questions. Choose an answer from the given options. Use your knowledge of science and common sense to best answer the question.",
-        "few_shot_examples": "../examples/squad_v2/seed.csv",
-        "few_shot_selection": "fixed",
-        "few_shot_num": 3,
-        "example_template": "Question: {question}\nOptions: {options}\nAnswer: {answer}"
-    }
-}
-```
-
-Notice in this case we don't have the `context` and pass in the `options` as list of string options. These are present in the dataset and are appropriately called in the example template.
-
-### Run the task
-
-```py
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = AutolabelDataset('data/squad_v2_test.csv', config = config)
-agent.plan(ds)
-agent.run(ds, max_items = 100)
-```
-
-### Evaluation metrics
-
-On running the above config, this is an example output expected for labeling 100 items.
-```
-Actual Cost: 0.13500600000000001
-┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ f1                 ┃ support ┃ threshold ┃ accuracy ┃ completion_rate ┃
-┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ 0.7018720299348971 │ 100     │ -inf      │ 0.59     │ 1.0             │
-└────────────────────┴─────────┴───────────┴──────────┴─────────────────
-```
-
-**Accuracy** - This is the exact match performance based on the reference answer. Here we give the model 1 if the answer matches exactly with the correct answer and 0 otherwise. This is particularly harsh for the model in cases where there isnt a multi choice given to the model for eg. Squad. Even if the model gets one word wrong without changing the meaning, the model will get penalized.
-
-**F1** - This is calculated by treating the predicted and the ground truth tokens as a list of tokens. Using this, an F1 score is calculated for every examples. This score can then be averaged over the entire dataset to get the final score. An exact match would get an F1 score of 1. This metric allows the model to make small mistakes in the predicted tokens and might be a more accurate metric for cases where the answers are not restricted to a set of options.
-
-**Completion Rate** - There can be errors while running the LLM related to labeling for eg. the LLM may give a label which is not in the label list or provide an answer which is not parsable by the library. In this cases, we mark the example as not labeled successfully. The completion rate refers to the proportion of examples that were labeled successfully.
diff --git a/docs/autolabel/guide/transforms/image_transform.md b/docs/autolabel/guide/transforms/image_transform.md
deleted file mode 100644
index 4e08140..0000000
--- a/docs/autolabel/guide/transforms/image_transform.md
+++ /dev/null
@@ -1,50 +0,0 @@
-The image transform allows users to extract text from image files. Autolabel uses optical character recognition (OCR) to read the images. To use this transform, follow these steps:
-
-## Installation
-
-Use the following command to download all dependencies for the image transform.
-
-```bash
-pip install pillow pytesseract
-```
-
-The tesseract engine is also required for OCR text extraction. See the [tesseract docs](https://tesseract-ocr.github.io/tessdoc/Installation.html) for installation instructions.
-
-## Parameters for this transform
-
-1. file_path_column: the name of the column containing the file paths of the pdf files to extract text from
-2. lang: a string indicating the language of the text in the pdf file. See the [tesseract docs](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) for a full list of supported languages
-
-## Using the transform
-
-Below is an example of an image transform to extract text from an image file:
-
-```json
-{
-  ..., # other config parameters
-  "transforms": [
-    ..., # other transforms
-    {
-      "name": "image",
-      "params": {
-        "file_path_column": "file_path",
-        "lang": "eng"
-      },
-      "output_columns": {
-        "content_column": "content",
-        "metadata_column": "metadata"
-      }
-    }
-  ]
-}
-```
-
-## Run the transform
-
-```python
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = agent.transform(ds)
-```
-
-This runs the transformation. We will see the content in the correct column. Access this using `ds.df` in the AutolabelDataset.
diff --git a/docs/autolabel/guide/transforms/introduction.md b/docs/autolabel/guide/transforms/introduction.md
deleted file mode 100644
index d97bb25..0000000
--- a/docs/autolabel/guide/transforms/introduction.md
+++ /dev/null
@@ -1,79 +0,0 @@
-Autolabel supports transformation of the input data! Input datasets are available in many shapes and form(at)s. We help you ingest your data in the format that you want in a way that is most useful for the downstream LLM or labeling task that you have in mind. We have tried to make the transforms performant, configurable and the outputs formatted in a way useful for the LLM.
-
-## Example
-Here we will show you how to run an example transform. We will use the Webpage Transform to ingest national park websites and label the state that every national park belongs to. You can find a Jupyter notebook with code that you can run on your own [here](https://github.com/refuel-ai/autolabel/blob/main/examples/transforms/example_webpage_transform.ipynb)  
-
-Use this webpage transform yourself here in a Colab - [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PwrdBUUX1u4X2SWjgKYNxB11Gb7XEIZs#scrollTo=1f17f05a)
-
-### Changes to config
-
-```json
-{
-    "task_name": "NationalPark",
-    "task_type": "question_answering",
-    "dataset": {
-    },
-    "model": {
-        "provider": "openai",
-        "name": "gpt-3.5-turbo"
-    },
-    "transforms": [{
-        "name": "webpage_transform",
-        "params": {
-            "url_column": "url"
-        },
-        "output_columns": {
-            "content_column": "content"
-        }
-    }],
-    "prompt": {
-        "task_guidelines": "You are an expert at understanding websites of national parks. You will be given a webpage about a national park. Answer with the US State that the national park is located in.",
-        "output_guidelines": "Answer in one word the state that the national park is located in.",
-        "example_template": "Content of wikipedia page: {content}\State:",
-    }
-}
-```
-
-Notice the `transforms` key in the config. This is where we define our transforms. Notice that this is a list meaning we can define multiple transforms here. Every element of this list is a transform. A transform is a json requiring 3 inputs -
-1. `name`: This tells the agent which transform needs to be loaded. Here we are using the webpage transform.
-2. `params`: This is the set of parameters that will be passed to the transform. Read the documentation of the separate transform to see what params can be passed to the transform here. Here we pass the url_column, i.e the column containing the webpages that need to be loaded.
-3. `output_columns`: Each transform can define multiple outputs. In this dictionary we map the output we need, in case `content_column` to the name of the column in the output dataset in which we want to populate this.
-
-### Running the transform
-```
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = agent.transform(ds)
-```
-
-This runs the transformation. We will see the content in the correct column. Access this using `ds.df` in the AutolabelDataset.
-
-### Running the labeling job
-```
-ds = agent.run(ds)
-```
-
-Simply run the labeling job on the transformed dataset. This will extract the state of the national park from each webpage.
-
-<figure markdown>
-  ![Transformation Labeling Run](/assets/transform_output.png){ width="600" }
-  <figcaption>Output of the transformation labeling run</figcaption>
-</figure>
-
-## Custom Transforms
-
-We support the following transforms -
-
-1. Webpage Transform
-2. PDF Transform
-
-We expect this list to grow in the future and need the help of the community to build transforms that work the best for their data. For this, we provide an abstraction that is easy to use. Any new transform just needs to be extend the `BaseTransform` class as penciled down below.
-
-::: src.autolabel.transforms.base
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-
-### `_apply()` `abstractmethod`
-::: src.autolabel.transforms.base.BaseTransform._apply
\ No newline at end of file
diff --git a/docs/autolabel/guide/transforms/pdf_transform.md b/docs/autolabel/guide/transforms/pdf_transform.md
deleted file mode 100644
index 3806e85..0000000
--- a/docs/autolabel/guide/transforms/pdf_transform.md
+++ /dev/null
@@ -1,77 +0,0 @@
-The PDF transform allows users to extract text from pdf files. Autolabel offers both direct text extraction, useful for extracting text from pdfs that contain text, and optical character recognition (OCR) text extraction, useful for extracting text from pdfs that contain images. To use this transform, follow these steps:
-
-## Installation
-
-For direct text extraction, install the <code>pdfplumber</code> package:
-
-```bash
-pip install pdfplumber
-```
-
-For OCR text extraction, install the <code>pdf2image</code> and <code>pytesseract</code> packages:
-
-```bash
-pip install pdf2image pytesseract
-```
-
-The tesseract engine is also required for OCR text extraction. See the [tesseract docs](https://tesseract-ocr.github.io/tessdoc/Installation.html) for installation instructions.
-
-## Parameters for this transform
-
-<ol>
-<li>file_path_column: the name of the column containing the file paths of the pdf files to extract text from</li>
-<li>ocr_enabled: a boolean indicating whether to use OCR text extraction or not</li>
-<li>page_format: a string containing the format to use for each page of the pdf file. The following fields can be used in the format string:
-<ul>
-<li>page_num: the page number of the page</li>
-<li>page_content: the content of the page</li>
-</ul></li>
-<li>page_sep: a string containing the separator to use between each page of the pdf file</li>
-<li>lang: a string indicating the language of the text in the pdf file. See the [tesseract docs](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) for a full list of supported languages</li>
-</ol>
-
-### Output Format
-
-The page_format and page_sep parameters define how the text extracted from the pdf will be formatted. For example, if the pdf file contained 2 pages with "Hello," on the first page and "World!" on the second, a page_format of <code>{page_num} - {page_content}</code> and a page_sep of <code>\n</code> would result in the following output:
-
-```python
-"1 - Hello,\n2 - World!"
-```
-
-The metadata column contains a dict with the field "num_pages" indicating the number of pages in the pdf file.
-
-## Using the transform
-
-Below is an example of a pdf transform to extract text from a pdf file:
-
-```json
-{
-  ..., # other config parameters
-  "transforms": [
-    ..., # other transforms
-    {
-      "name": "pdf",
-      "params": {
-        "file_path_column": "file_path",
-        "ocr_enabled": true,
-        "page_format": "Page {page_num}: {page_content}",
-        "page_sep": "\n\n"
-      },
-      "output_columns": {
-        "content_column": "content",
-        "metadata_column": "metadata"
-      }
-    }
-  ]
-}
-```
-
-## Run the transform
-
-```python
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = agent.transform(ds)
-```
-
-This runs the transformation. We will see the content in the correct column. Access this using `ds.df` in the AutolabelDataset.
diff --git a/docs/autolabel/guide/transforms/webpage_transform.md b/docs/autolabel/guide/transforms/webpage_transform.md
deleted file mode 100644
index a8742fe..0000000
--- a/docs/autolabel/guide/transforms/webpage_transform.md
+++ /dev/null
@@ -1,53 +0,0 @@
-The Webpage transform supports loading and processing webpage urls. Given a url, this transform will send the request to load the webpage and then parse the webpage returned to collect the text to send to the LLM.
-
-Use this transform yourself here in a Colab - [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PwrdBUUX1u4X2SWjgKYNxB11Gb7XEIZs#scrollTo=1f17f05a)
-
-In order to use this transform, use the following steps:
-
-## Installation
-
-Use the following command to download all dependencies for the webpage transform. `beautifulsoup4` must be version `4.12.2` or higher.
-
-```bash
-pip install beautifulsoup4 httpx fake_useragent
-```
-
-Make sure to do this before running the transform.
-
-## Parameters for this transform
-
-1. `url_column: str (Required)`: The column to retrieve the url from. This is the webpage that will be loaded by the transform.
-2. `timeout: int (Optional: Default = 5)`: The timeout to wait until for loading the webpage. The request to the webpage will timeout after this. We will log an error and send an empty response after the timeout is reached.
-3. `headers: Dict[str,str] (Optional: Default = {})`: Any headers that need to be passed into the webpage load request. Underneath we use requests to get the webpage and the headers are passed to request.
-
-## Using the transform
-
-Below is an example of a webpage transform to extract text from a webpage:
-
-```json
-{
-  ..., # other config parameters
-  "transforms": [
-    ..., # other transforms
-    {
-      "name": "webpage_transform",
-      "params": {
-        "url_column": "url"
-      },
-      "output_columns": {
-        "content_column": "webpage_content",
-      }
-    }
-  ]
-}
-```
-
-## Run the transform
-
-```python
-from autolabel import LabelingAgent, AutolabelDataset
-agent = LabelingAgent(config)
-ds = agent.transform(ds)
-```
-
-This runs the transformation. We will see the content in the webpage_content column. Access this using `ds.df` in the AutolabelDataset.
diff --git a/docs/autolabel/introduction.md b/docs/autolabel/introduction.md
deleted file mode 100644
index f4ca640..0000000
--- a/docs/autolabel/introduction.md
+++ /dev/null
@@ -1,35 +0,0 @@
-<img src="/assets/Autolabel_blk.png#only-light" alt="isolated" width="100%"/>
-<img src="/assets/Autolabel_wt.png#only-dark" alt="isolated" width="100%"/>
-
-**Autolabel** is a Python library to label, clean and enrich datasets with Large Language Models (LLMs).
-
-## 🌟 (New!) Access RefuelLLM through Autolabel
-
-You can access RefuelLLM, our recently announced LLM purpose built for data labeling, through Autolabel (Read more about it in this [blog post](http://www.refuel.ai/blog-posts/announcing-refuel-llm)). Refuel LLM is a Llama-v2-13b base model, instruction tuned on over 2500 unique (5.24B tokens) labeling tasks spanning categories such as classification, entity resolution, matching, reading comprehension and information extraction. You can experiment with the model in the playground [here](https://app.refuel.ai/playground).
-
-<img alt="Refuel Performance" src="/assets/refuel_llm_performance.png" width="100%">
-
-You can request access to Refuel LLM [here](https://refuel-ai.typeform.com/llm-access). Read the docs about using RefuelLLM in autolabel [here](https://docs.refuel.ai/autolabel/guide/llms/llms/#refuel).
-
-## Features
-
-- Autolabel data for [NLP tasks](https://docs.refuel.ai/autolabel/guide/tasks/classification_task/) such as classification, question-answering and named entity-recognition, entity matching and more.
-- Seamlessly use commercial and open source [LLMs](https://docs.refuel.ai/autolabel/guide/llms/llms/) from providers such as OpenAI, Anthropic, HuggingFace, Google and more.
-- Leverage research-proven LLM techniques to boost label quality, such as few-shot learning and chain-of-thought prompting.
-- [Confidence estimation](https://docs.refuel.ai/autolabel/guide/accuracy/confidence/) and explanations out of the box for every single output label
-- [Caching and state management](https://docs.refuel.ai/autolabel/guide/reliability/state-management/) to minimize costs and experimentation time
-
-## Getting Started
-
-You can get started with Autolabel by simpling bringing the dataset you want to label, picking your favorite LLM and writing a few lines of code.
-
-- [Installation and your first labeling task](guide/overview/getting-started.md): Steps to install Autolabel and run sentiment analysis for movie reviews using OpenAI's `gpt-3.5-turbo`.
-- [Classification tutorial](guide/overview/tutorial-classification.md): A deeper dive into how Autolabel can be used to detect toxic comments at 95%+ accuracy.
-- [Command Line Interface](https://docs.refuel.ai/autolabel/guide/resources/CLI): Learn how to use Autolabel's CLI to intuitively create configs from the command line.
-- [Here](https://github.com/refuel-ai/autolabel/tree/main/examples) are more examples with sample notebooks that show how Autolabel can be used for different NLP tasks.
-
-## Resources
-
-- <a href="https://discord.gg/uEdr8nrMGm" target="_blank">Discord</a>: Join our Discord community for conversations on LLMs, Autolabel and so much more!
-- [Github](https://github.com/refuel-ai/autolabel): Create an issue to report any bugs or give us a star on Github.
-- [Contribute](https://github.com/refuel-ai/autolabel/blob/main/CONTRIBUTING.md): Share your feedback or add new features, and help us improve Autolabel!
\ No newline at end of file
diff --git a/docs/autolabel/reference/cache.md b/docs/autolabel/reference/cache.md
deleted file mode 100644
index 8985261..0000000
--- a/docs/autolabel/reference/cache.md
+++ /dev/null
@@ -1,14 +0,0 @@
-::: src.autolabel.cache.base.BaseCache
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.cache.sqlalchemy_generation_cache.SQLAlchemyGenerationCache
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.cache.sqlalchemy_transform_cache.SQLAlchemyTransformCache
-rendering:
-show_root_heading: yes
-show_root_full_path: no
diff --git a/docs/autolabel/reference/configs.md b/docs/autolabel/reference/configs.md
deleted file mode 100644
index 7934aad..0000000
--- a/docs/autolabel/reference/configs.md
+++ /dev/null
@@ -1,9 +0,0 @@
-::: src.autolabel.configs.base
-    rendering:
-        show_root_full_path: no
-        show_root_toc_entry: no
-
-::: src.autolabel.configs.config
-    rendering:
-        show_root_full_path: no
-        show_root_toc_entry: no
\ No newline at end of file
diff --git a/docs/autolabel/reference/data_models.md b/docs/autolabel/reference/data_models.md
deleted file mode 100644
index b8614ab..0000000
--- a/docs/autolabel/reference/data_models.md
+++ /dev/null
@@ -1,45 +0,0 @@
-The Data Model classes are used to save the progress of AutoLabel jobs in an SQL database.
-
-Saved data is stored in .autolabel.db
-
-Every Data Model class implements its own "get" and "create" methods for accessing this saved data.
-
-::: src.autolabel.data_models.annotation.AnnotationModel
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.data_models.generation_cache.GenerationCacheEntryModel
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.data_models.transform_cache.TransformCacheEntryModel
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.data_models.dataset.DatasetModel
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.data_models.task.TaskModel
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.data_models.task_run.TaskRunModel
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.database.state_manager.StateManager
-rendering:
-show_root_heading: yes
-show_root_full_path: no
-
-::: src.autolabel.database.engine.create_db_engine
-rendering:
-show_root_heading: yes
-show_root_full_path: no
diff --git a/docs/autolabel/reference/example_select.md b/docs/autolabel/reference/example_select.md
deleted file mode 100644
index e96fcb2..0000000
--- a/docs/autolabel/reference/example_select.md
+++ /dev/null
@@ -1,16 +0,0 @@
-::: src.autolabel.few_shot
-    rendering:
-        show_root_heading: no
-        show_root_full_path: no
-        merge_init_into_class: no
-        show_root_toc_entry: no
-
-::: src.autolabel.few_shot.fixed_example_selector.FixedExampleSelector
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.few_shot.vector_store
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
diff --git a/docs/autolabel/reference/labeler.md b/docs/autolabel/reference/labeler.md
deleted file mode 100644
index 81cdea2..0000000
--- a/docs/autolabel/reference/labeler.md
+++ /dev/null
@@ -1,4 +0,0 @@
-::: src.autolabel.labeler.LabelingAgent
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
\ No newline at end of file
diff --git a/docs/autolabel/reference/models.md b/docs/autolabel/reference/models.md
deleted file mode 100644
index e034272..0000000
--- a/docs/autolabel/reference/models.md
+++ /dev/null
@@ -1,36 +0,0 @@
-::: src.autolabel.models.base.BaseModel
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.models
-    rendering:
-        show_root_heading: no
-        show_root_full_path: no
-        merge_init_into_class: no
-        show_root_toc_entry: no
-
-::: src.autolabel.models.anthropic.AnthropicLLM
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.models.hf_pipeline.HFPipelineLLM
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.models.openai.OpenAILLM
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.models.palm.PaLMLLM
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.models.refuel.RefuelLLM
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
\ No newline at end of file
diff --git a/docs/autolabel/reference/schema.md b/docs/autolabel/reference/schema.md
deleted file mode 100644
index cc11ce6..0000000
--- a/docs/autolabel/reference/schema.md
+++ /dev/null
@@ -1,4 +0,0 @@
-::: src.autolabel.schema
-    rendering:
-        show_root_full_path: no
-        show_root_toc_entry: no
\ No newline at end of file
diff --git a/docs/autolabel/reference/tasks.md b/docs/autolabel/reference/tasks.md
deleted file mode 100644
index 22d5c06..0000000
--- a/docs/autolabel/reference/tasks.md
+++ /dev/null
@@ -1,29 +0,0 @@
-::: src.autolabel.tasks.base.BaseTask
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.tasks.classification.ClassificationTask
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.tasks.entity_matching.EntityMatchingTask
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.tasks.question_answering.QuestionAnsweringTask
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.tasks.named_entity_recognition.NamedEntityRecognitionTask
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
-
-::: src.autolabel.tasks.utils
-    rendering:
-        show_root_heading: yes
-        show_root_full_path: no
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
index 5e5283e..d01029a 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -22,7 +22,7 @@ hide:
 
     Label, clean and enrich text datasets with LLMs.
 
-    [:octicons-arrow-right-24: Docs](autolabel/introduction.md)
+    [:octicons-arrow-right-24: Docs](autolabel/docs/index.md)
 
 
     [:octicons-arrow-right-24: Github](https://github.com/refuel-ai/autolabel)
diff --git a/mkdocs.yml b/mkdocs.yml
index dfd5568..7119e67 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -9,49 +9,49 @@ nav:
   - Integrations:
       - Introduction: integrations/introduction.md
       - AWS: integrations/aws.md
-  - Autolabel:
-      - Introduction: autolabel/introduction.md
-      - Getting Started: autolabel/guide/overview/getting-started.md
-      - Tutorial - Toxic comment classification: autolabel/guide/overview/tutorial-classification.md
+  - Autolabel Docs:
+      - Introduction: autolabel/docs/index.md
+      - Getting Started: autolabel/docs/guide/overview/getting-started.md
+      - Tutorial - Toxic comment classification: autolabel/docs/guide/overview/tutorial-classification.md
       - Models:
-          - LLMs: autolabel/guide/llms/llms.md
-          - Embedding Models: autolabel/guide/llms/embeddings.md
-          - Benchmarks: autolabel/guide/llms/benchmarks.md
+          - LLMs: autolabel/docs/guide/llms/llms.md
+          - Embedding Models: autolabel/docs/guide/llms/embeddings.md
+          - Benchmarks: autolabel/docs/guide/llms/benchmarks.md
       - Labeling Tasks:
-          - Classification Task: autolabel/guide/tasks/classification_task.md
-          - Multilabel Classification Task: autolabel/guide/tasks/multilabel_classification_task.md
-          - Entity Matching Task: autolabel/guide/tasks/entity_matching_task.md
-          - Named Entity Recognition Task: autolabel/guide/tasks/named_entity_recognition_task.md
-          - Question Answering Task: autolabel/guide/tasks/question_answering_task.md
-          - Attribute Extraction: autolabel/guide/tasks/attribute_extraction.md
+          - Classification Task: autolabel/docs/guide/tasks/classification_task.md
+          - Multilabel Classification Task: autolabel/docs/guide/tasks/multilabel_classification_task.md
+          - Entity Matching Task: autolabel/docs/guide/tasks/entity_matching_task.md
+          - Named Entity Recognition Task: autolabel/docs/guide/tasks/named_entity_recognition_task.md
+          - Question Answering Task: autolabel/docs/guide/tasks/question_answering_task.md
+          - Attribute Extraction: autolabel/docs/guide/tasks/attribute_extraction.md
       - Transformations:
-          - Introduction: autolabel/guide/transforms/introduction.md
-          - Webpage Transform: autolabel/guide/transforms/webpage_transform.md
-          - PDF Transform: autolabel/guide/transforms/pdf_transform.md
-          - Image Transform: autolabel/guide/transforms/image_transform.md
+          - Introduction: autolabel/docs/guide/transforms/introduction.md
+          - Webpage Transform: autolabel/docs/guide/transforms/webpage_transform.md
+          - PDF Transform: autolabel/docs/guide/transforms/pdf_transform.md
+          - Image Transform: autolabel/docs/guide/transforms/image_transform.md
       - Improving Labeling Accuracy:
-          - Prompting Better: autolabel/guide/accuracy/prompting-better.md
-          - Few-shot Prompting: autolabel/guide/accuracy/few-shot.md
-          - Confidence: autolabel/guide/accuracy/confidence.md
-          - Chain of Thought: autolabel/guide/accuracy/chain-of-thought.md
+          - Prompting Better: autolabel/docs/guide/accuracy/prompting-better.md
+          - Few-shot Prompting: autolabel/docs/guide/accuracy/few-shot.md
+          - Confidence: autolabel/docs/guide/accuracy/confidence.md
+          - Chain of Thought: autolabel/docs/guide/accuracy/chain-of-thought.md
       - Reliability and Robustness:
-          - LLM Output Caching: autolabel/guide/reliability/llm-output-caching.md
-          - State Management: autolabel/guide/reliability/state-management.md
-      - Working with Autolabel:
-          - Configs: autolabel/guide/resources/configs.md
-          - AutolabelDataset: autolabel/guide/resources/autolabel_dataset.md
-          - CLI: autolabel/guide/resources/CLI.md
-          - Refuel-provided Datasets: autolabel/guide/resources/refuel_datasets.md
-          - Synthetic Dataset Generation: autolabel/guide/resources/synthetic_dataset_generation.md
+          - LLM Output Caching: autolabel/docs/guide/reliability/llm-output-caching.md
+          - State Management: autolabel/docs/guide/reliability/state-management.md
+      - Working with autolabel/docs:
+          - Configs: autolabel/docs/guide/resources/configs.md
+          - autolabel/docsDataset: autolabel/docs/guide/resources/autolabel/docs_dataset.md
+          - CLI: autolabel/docs/guide/resources/CLI.md
+          - Refuel-provided Datasets: autolabel/docs/guide/resources/refuel_datasets.md
+          - Synthetic Dataset Generation: autolabel/docs/guide/resources/synthetic_dataset_generation.md
       - Reference:
-          - AutoLabeler: autolabel/reference/labeler.md
-          - Config: autolabel/reference/configs.md
-          - Models: autolabel/reference/models.md
-          - Tasks: autolabel/reference/tasks.md
-          - Schema: autolabel/reference/schema.md
-          - Cache: autolabel/reference/cache.md
-          - Example Selector: autolabel/reference/example_select.md
-          - Data Models: autolabel/reference/data_models.md
+          - autolabel/docser: autolabel/docs/reference/labeler.md
+          - Config: autolabel/docs/reference/configs.md
+          - Models: autolabel/docs/reference/models.md
+          - Tasks: autolabel/docs/reference/tasks.md
+          - Schema: autolabel/docs/reference/schema.md
+          - Cache: autolabel/docs/reference/cache.md
+          - Example Selector: autolabel/docs/reference/example_select.md
+          - Data Models: autolabel/docs/reference/data_models.md
 theme:
   name: material
   favicon: assets/favicon.ico
@@ -84,7 +84,7 @@ plugins:
   - mkdocstrings:
       handlers:
         python:
-          paths: [autolabel] # search packages in the autolabel folder
+          paths: [autolabel/docs] # search packages in the autolabel/docs folder
   - search
   - mkdocs-jupyter
   - table-reader