Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.
Open
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
158 commits
Select commit Hold shift + click to select a range
f820019
add retrieval dataset construction codes
Liangyx2 Mar 1, 2024
06f8162
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
5ef0332
Update llm_generate_raw_data.py
Liangyx2 Mar 1, 2024
ee1db83
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
89597f2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
b132d66
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
8e955ce
update
Liangyx2 Mar 1, 2024
635b906
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
d7d3d03
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
c9fec02
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
5e32113
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
f67622c
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
f2e344a
Delete intel_extension_for_transformers/neural_chat/tools/evaluation/…
Liangyx2 Mar 1, 2024
383e5b3
Update prompt.py
Liangyx2 Mar 4, 2024
81014d1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 4, 2024
4b7bec7
Update llm_generate_raw_data.py
Liangyx2 Mar 4, 2024
0df51a6
Update llm_generate_raw_data.py
Liangyx2 Mar 4, 2024
95b16bd
Update retrieval_dataset_construction.py
Liangyx2 Mar 4, 2024
80dd21b
Update llm_generate_raw_data.py
Liangyx2 Mar 4, 2024
f495b22
Update mine_hard_negatives_check_similarity.py
Liangyx2 Mar 4, 2024
593dee3
add test_evaluation.py to nightly test
Liangyx2 Mar 4, 2024
cf59b18
Update and rename requirements.txt to requirements_cpu.txt
Liangyx2 Mar 4, 2024
40e0b0e
Create requirements_cuda.txt
Liangyx2 Mar 4, 2024
bf1b1aa
Update requirements.txt
Liangyx2 Mar 4, 2024
5552ebc
Update retrieval_dataset_construction.py
Liangyx2 Mar 4, 2024
d3b7579
Update llm_generate_raw_data.py
Liangyx2 Mar 4, 2024
f500b2b
Update retrieval_dataset_construction.py
Liangyx2 Mar 4, 2024
b65c4bf
Update llm_generate_raw_data.py
Liangyx2 Mar 4, 2024
c43ab73
Update test_evaluation.py
Liangyx2 Mar 4, 2024
feda3c0
Update retrieval_dataset_construction.py
Liangyx2 Mar 4, 2024
1c2c22c
Update mine_hard_negatives_check_similarity.py
Liangyx2 Mar 4, 2024
55a5cda
add README.md
Liangyx2 Mar 6, 2024
7a74f86
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 6, 2024
39754d0
Update README.md
Liangyx2 Mar 7, 2024
d7e95f0
add evaluate_retrieval.py
Liangyx2 Mar 8, 2024
186ab43
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 8, 2024
1496219
Update test_evaluation.py
Liangyx2 Mar 11, 2024
03a768e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2024
128d587
Update test_evaluation.py
Liangyx2 Mar 11, 2024
25177bd
Merge branch 'main' into yuxiang/evaluation
XuehaoSun Mar 11, 2024
705752a
add README.md
Liangyx2 Mar 11, 2024
675fe2e
Update prompt.py
Liangyx2 Mar 12, 2024
988e542
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 12, 2024
d0c3c34
add llm_generate_truth.py and data
Liangyx2 Mar 12, 2024
be1106b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 12, 2024
48788d4
add ragas_evaluation.py
Liangyx2 Mar 12, 2024
54cc6c0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 12, 2024
e1b5585
Create requirements.txt
Liangyx2 Mar 12, 2024
88a4293
Update llm_generate_truth.py
Liangyx2 Mar 12, 2024
83060f9
Update evaluate_retrieval.py
Liangyx2 Mar 12, 2024
76b1175
Update ragas_evaluation.py
Liangyx2 Mar 12, 2024
b775095
Update test_evaluation.py
Liangyx2 Mar 12, 2024
edbb32c
Update llm_generate_truth.py
Liangyx2 Mar 12, 2024
8962abf
Update README.md
Liangyx2 Mar 14, 2024
2ef4e05
Update README.md
Liangyx2 Mar 14, 2024
d2ab7d8
add README.md
Liangyx2 Mar 14, 2024
bcdf209
Update README.md
Liangyx2 Mar 14, 2024
102649b
Update README.md
Liangyx2 Mar 14, 2024
36a28a4
Update README.md
Liangyx2 Mar 14, 2024
548fdd9
Add files via upload
Liangyx2 Mar 15, 2024
36448ea
Delete intel_extension_for_transformers/neural_chat/tests/ci/tools/te…
Liangyx2 Mar 15, 2024
26e3e9d
Update requirements.txt
Liangyx2 Mar 15, 2024
e4793d3
Update README.md
Liangyx2 Mar 15, 2024
0569b54
Update hn_mine.py
Liangyx2 Mar 15, 2024
2d15ec0
Update README.md
Liangyx2 Mar 15, 2024
e8127e9
Update ragas_evaluation.py
Liangyx2 Mar 18, 2024
321e9b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 18, 2024
f9b4dab
Update requirements.txt
Liangyx2 Mar 18, 2024
76dc219
Update README.md
Liangyx2 Mar 18, 2024
b9db553
Update README.md
Liangyx2 Mar 18, 2024
d7b68cb
Update README.md
Liangyx2 Mar 18, 2024
48de606
Update requirements.txt
Liangyx2 Mar 18, 2024
415ebc8
Update ragas_evaluation.py
Liangyx2 Mar 18, 2024
f03badd
Update test_evaluation.py
Liangyx2 Mar 18, 2024
2b92e74
Update README.md
Liangyx2 Mar 18, 2024
9091729
Update retrieval_dataset_construction.py
Liangyx2 Mar 18, 2024
be32736
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 18, 2024
2c4f452
Update hn_mine.py
Liangyx2 Mar 18, 2024
c48f66a
Update llm_generate_raw_data.py
Liangyx2 Mar 18, 2024
654c44a
Update mine_hard_negatives_check_similarity.py
Liangyx2 Mar 18, 2024
5208c98
Update hn_mine.py
Liangyx2 Mar 18, 2024
ace1090
Update test_evaluation.py
Liangyx2 Mar 18, 2024
83f10e9
Update ragas_evaluation.py
Liangyx2 Mar 18, 2024
ac0aef1
Update README.md
Liangyx2 Mar 18, 2024
8deaabd
Update README.md
Liangyx2 Mar 19, 2024
2eb084c
Update README.md
Liangyx2 Mar 19, 2024
510e801
Update README.md
Liangyx2 Mar 19, 2024
dd1f37c
Update README.md
Liangyx2 Mar 19, 2024
ed95d2d
Update prompt.py
Liangyx2 Mar 19, 2024
e253f41
Update ragas_evaluation.py
Liangyx2 Mar 19, 2024
fc0b6b9
add evaluate_retrieval_auto.py
Liangyx2 Mar 20, 2024
6f081b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 20, 2024
746adec
Update evaluate_retrieval_auto.py
Liangyx2 Mar 21, 2024
100322e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 21, 2024
5e07789
Update evaluate_retrieval.py
Liangyx2 Mar 21, 2024
0a2f742
Update ragas_evaluation.py
Liangyx2 Mar 21, 2024
1752684
Update test_evaluation.py
Liangyx2 Mar 21, 2024
2a2238e
Update ragas_evaluation.py
Liangyx2 Mar 22, 2024
e8f0f9c
Update README.md
Liangyx2 Mar 22, 2024
8d65078
Update and rename evaluate_retrieval_auto.py to evaluate_retrieval_be…
Liangyx2 Mar 22, 2024
a951a89
Update evaluate_retrieval_benchmark.py
Liangyx2 Mar 25, 2024
13921f6
add retrieval_benchmark.py
Liangyx2 Mar 25, 2024
02c0813
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 25, 2024
d212d66
Update retrieval_benchmark.py
Liangyx2 Mar 25, 2024
20529a4
add ragas_benchmark ragas_evaluation_benchmark
Liangyx2 Mar 26, 2024
5026421
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 26, 2024
cfa7d9c
Update retrieval_benchmark.py
Liangyx2 Mar 26, 2024
8d1215e
Update evaluate_retrieval_benchmark.py
Liangyx2 Mar 26, 2024
3458a8e
Update retrieval_benchmark.py
Liangyx2 Mar 26, 2024
4effd37
Update ragas_evaluation_benchmark.py
Liangyx2 Mar 26, 2024
3c38ae6
Update ragas_benchmark.py
Liangyx2 Mar 26, 2024
b02da07
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 26, 2024
a2a7de1
Update ragas_evaluation_benchmark.py
Liangyx2 Mar 26, 2024
4191f4b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 26, 2024
35b2d7d
Update evaluate_retrieval_benchmark.py
Liangyx2 Mar 27, 2024
56037b9
Update ragas_evaluation_benchmark.py
Liangyx2 Mar 27, 2024
de44f0d
add retrieval_benchmark.sh
Liangyx2 Mar 27, 2024
67456e4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 27, 2024
2a91336
add ragas_benchmark.sh
Liangyx2 Mar 27, 2024
8f05a34
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 27, 2024
c64ca3c
add data.txt
Liangyx2 Mar 27, 2024
fbef1f6
Update ragas_benchmark.sh
Liangyx2 Mar 27, 2024
f50aeb4
Update ragas_evaluation_benchmark.py
Liangyx2 Mar 28, 2024
84aea7c
Update ragas_benchmark.sh
Liangyx2 Mar 28, 2024
ad1814a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2024
932562d
Update and rename ragas_benchmark.py to ragas_superbenchmark.py
Liangyx2 Mar 28, 2024
50d8c83
Update evaluate_retrieval_benchmark.py
Liangyx2 Mar 28, 2024
a4ea5dd
Update retrieval_benchmark.sh
Liangyx2 Mar 28, 2024
6e29d43
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2024
702f9a9
Update and rename retrieval_benchmark.py to retrieval_superbenchmark.py
Liangyx2 Mar 28, 2024
0452526
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2024
008a892
add README.md
Liangyx2 Mar 28, 2024
5303837
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2024
8957b18
Update README.md
Liangyx2 Mar 28, 2024
96f477c
Update README.md
Liangyx2 Mar 29, 2024
c99856d
Update README.md
Liangyx2 Mar 29, 2024
19dfb93
Update README.md
Liangyx2 Apr 1, 2024
99940f3
Update README.md
Liangyx2 Apr 1, 2024
464d52b
Update README.md
Liangyx2 Apr 1, 2024
da2e829
Update README.md
Liangyx2 Apr 1, 2024
3ce2cb2
Update README.md
Liangyx2 Apr 1, 2024
268d89c
Update README.md
Liangyx2 Apr 1, 2024
40fc2e9
Update README.md
Liangyx2 Apr 1, 2024
13bb3b8
Update README.md
Liangyx2 Apr 1, 2024
763bd1d
add config file form rag evaluation
xmx-521 Apr 10, 2024
092e951
complete config superbenchmark
xmx-521 Apr 15, 2024
e931143
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 15, 2024
f0a0cd6
Merge branch 'main' into yuxiang/evaluation
XuhuiRen May 8, 2024
895075b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 8, 2024
6b60154
Create test_evaluation.py in CI
Liangyx2 May 10, 2024
c73a68f
Update requirements.txt
Liangyx2 May 11, 2024
c6f8906
Merge branch 'main' into yuxiang/evaluation
Liangyx2 May 11, 2024
7c80ce2
Merge branch 'main' into yuxiang/evaluation
VincyZhang May 13, 2024
576ce57
Merge branch 'main' into yuxiang/evaluation
Liangyx2 May 14, 2024
2a3ddd9
Merge branch 'main' into yuxiang/evaluation
Liangyx2 May 15, 2024
b4c0e67
Update ragas_evaluation_benchmark.py
Liangyx2 Jun 3, 2024
e75bbe4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2024
a0853a8
Merge branch 'main' into yuxiang/evaluation
Liangyx2 Jun 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions intel_extension_for_transformers/neural_chat/prompts/prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,3 +321,39 @@ def generate_sqlcoder_prompt(qurey, metadata_file):
qurey=qurey, table_metadata_string=table_metadata_string
)
return prompt

QUERYGENERATE_PROMPT = """
Task: You are asked to act as a human annotator. Your role is to generate 2 specific, open-ended questions based on the provided context.
Each question should aim to extract or clarify key information from the context, focusing on a single aspect or detail.
The questions must be directly related to the context to form a query-positive pair, suitable for use in constructing a retrieval dataset.
---
Requirements:
1. Questions should be based on the keywords, such as phrases at the beginning, phrases before colon, and recurring phrases in the context.
2. Use the terms in the context instead of pronouns.
---
Desired format:
1. <question_1>
2. <question_2>
---
Context:
### {context}
---
Generated questions:
"""

TRUTHGENERATE_PROMPT = """
Task: You are asked to act as a human annotator. Your role is to generate the right answer based on the context and question provided.
Answers should aim to extract or clarify the key information of the question from the context, focusing on a single aspect or detail.
The answer must be directly related to the context and the question, suitable for use in constructing a synthetic retrieval evaluation dataset.
---
Desired format:
1. <ground_truth>
---
Question:
### {question}
---
Context:
### {context}
---
Generated ground_truth:
"""
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# !/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2023 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest, os, shutil
from unittest.mock import patch
from intel_extension_for_transformers.neural_chat.tools.evaluation.data_augmentation import retrieval_dataset_construction, llm_generate_truth
from intel_extension_for_transformers.neural_chat.tools.evaluation.retriever import evaluate_retrieval


class TestEvaluation(unittest.TestCase):
def setUp(self) -> None:
if os.path.exists("data.jsonl"):
os.remove("data.jsonl")
if os.path.exists("data_minedHN.jsonl"):
os.remove("data_minedHN.jsonl")
if os.path.exists("data_minedHN_split.jsonl"):
os.remove("data_minedHN_split.jsonl")
if os.path.exists("ground_truth.jsonl"):
os.remove("ground_truth.jsonl")
if os.path.exists("output"):
shutil.rmtree("output", ignore_errors=True)
return super().setUp()

def tearDown(self) -> None:
if os.path.exists("data.jsonl"):
os.remove("data.jsonl")
if os.path.exists("data_minedHN.jsonl"):
os.remove("data_minedHN.jsonl")
if os.path.exists("data_minedHN_split.jsonl"):
os.remove("data_minedHN_split.jsonl")
if os.path.exists("ground_truth.jsonl"):
os.remove("ground_truth.jsonl")
if os.path.exists("output"):
shutil.rmtree("output", ignore_errors=True)
return super().tearDown()

def test_retrieval_dataset_construction(self):
argv = ['--llm_model', '/tf_dataset2/models/nlp_toolkit/neural-chat-7b-v3-1', \
'--embedding_model', '/tf_dataset2/inc-ut/gte-base', \
'--input', '/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/assets/docs/retrieve_multi_doc/', \
'--output', 'data', \
'--range_for_sampling', '2-2', \
'--negative_number', '1']
with patch('sys.argv', ['python retrieval_dataset_construction.py'] + argv):
retrieval_dataset_construction.main()
self.assertTrue(os.path.exists("data_minedHN_split.jsonl"))

def test_llm_generate_truth(self):
argv = ['--llm_model', '/tf_dataset2/models/nlp_toolkit/neural-chat-7b-v3-1', \
'--input', '/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation/example.jsonl', \
'--output', 'ground_truth.jsonl']
with patch('sys.argv', ['python llm_generate_truth.py'] + argv):
llm_generate_truth.main()
self.assertTrue(os.path.exists("ground_truth.jsonl"))

def test_evaluate_retrieval(self):
argv = ['--index_file_jsonl_path', '/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation/candidate_context.jsonl', \
'--query_file_jsonl_path', '/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation/example.jsonl', \
'--embedding_model', '/tf_dataset2/inc-ut/gte-base']
with patch('sys.argv', ['python evaluate_retrieval.py'] + argv):
result = evaluate_retrieval.main()
self.assertIsNotNone(result)

if __name__ == '__main__':
unittest.main()
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ langchain_core==0.1.18
langid
librosa
markdown
modelscope
neural-compressor
neural_speed
num2words
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2023 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Retrieval Data Augmentation

## 1. Introduction
In this example, we show how to do data augmentation to construct a retrieval dataset.

* **Context to Question and Mine Hard Negatives**
The effect is to generate several specific open-ended questions based on the context of the input file provided. The questions are directly related to the context to form a query-positive pair, suitable for use in constructing a retrieval dataset. Then we sample some from the entire corpus as the negatives by mining hard negatives, which is a widely used method to improve the quality of finetuning sentence embedding models.

* **Context, Question to Ground Truth**
The effect is to generate the right answer based on the context and question provided. The answer is directly related to the context and the question, suitable for use in constructing a synthetic retrieval evaluation dataset.

## 2. Supported Devices
CPU, CUDA

## 3. Requirements
```
git clone https://github.com/intel/intel-extension-for-transformers.git
cd intel-extension-for-transformers/intel_extension_for_transformers/neural_chat
pip install -r requirements.txt
cd pipeline/plugins/retrieval
pip install -r requirements.txt
```

* **On CPU**
```
cd intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation
pip install -r requirements_cpu.txt
```

* **On CUDA**
```
cd intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation
pip install -r requirements_cuda.txt
```

## 4. Retrieval Dataset Construction
### Context to Questions and Mine Hard Negatives
* **On CPU**
```
cd intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation
python -m data_augmentation.retrieval_dataset_construction \
--llm_model <llm model path> \
--embedding_model <embedding model path> \
--input <your input file path>
```

* **On CUDA**
```
cd intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation
python -m data_augmentation.retrieval_dataset_construction \
--llm_model <llm model path> \
--embedding_model <embedding model path> \
--input <your input file path> \
--use_gpu_for_searching True
```

**Some Important Arguments**:
- `llm_model`: The path for the LLM model.
- `embedding_model`: The path for the text embedding model.
- `input`: The path of the file/folder/link of the content.
- `output`: The name of output files. The default value is 'data'. The default output files are 'data.jsonl', 'data_minedHN.jsonl', 'data_minedHN_split.jsonl'.
- `temperature`: The value is used to modulate the next token probabilities, and will influence the distribution of similarity scores. The default value is 0.8.
- `top_p`: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. The default value is 0.9.
- `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. The default value is 40.
- `repetition_penalty`: The parameter for repetition penalty. 1.0 means no penalty. The default value is 2.0.
- `max_new_tokens`: The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. The default value is 48.
- `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. The default value is True.
- `num_beams`: Number of beams for beam search. 1 means no beam search. The default value is 2.
- `num_return_sequences`: The number of independently computed returned sequences for each element in the batch. The default value is 2.
- `use_cache`: Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. The default value is True.
- `range_for_sampling`: The range to sample negatives. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. You can set a larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages). The default value is '2-10'.
- `negative_number`: The number of sampled negatives. The default value is 5.
- `use_gpu_for_searching`: Whether to use faiss-gpu to retrieve negatives. The default value is False.
- `similarity_threshold`: The cosine similarity threshold used to filter the generated queries. The default value is 0.6.

**Result**:
Three files will be generated. The default output files are `data.jsonl`, `data_minedHN.jsonl`, `data_minedHN_split.jsonl`. The third is the final output dataset, where each line is a dict like this:
```
{"query": str, "pos": List[str], "neg": List[str]}
```
`query` is the query, and `pos` is a positive text, based on the context of the input file provided, `neg` is a list of negative texts.
See [augmented_example.jsonl](https://github.com/intel/intel-extension-for-transformers/blob/master/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation/augmented_example.jsonl) for a data file.


### Context, Question to Ground Truth
```
cd intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation
python llm_generate_truth.py \
--llm_model <llm model path> \
--input example.jsonl \
--output ground_truth.jsonl
```

**Some Important Arguments**:
- `llm_model`: The path for the LLM model.
- `input`: The path of JSON data including queries and positives where each line is a dict like this:```{"query": str, "pos": List[str]}```. See [example.jsonl](https://github.com/intel/intel-extension-for-transformers/blob/master/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation/example.jsonl) for a data file.
- `output`: The path of the output JSON data.
- `temperature`: The value is used to modulate the next token probabilities, and will influence the distribution of similarity scores. The default value is 0.8.
- `top_p`: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. The default value is 0.9.
- `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. The default value is 40.
- `repetition_penalty`: The parameter for repetition penalty. 1.0 means no penalty. The default value is 2.0.
- `max_new_tokens`: The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. The default value is 48.
- `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. The default value is True.
- `num_beams`: Number of beams for beam search. 1 means no beam search. The default value is 2.
- `num_return_sequences`: The number of independently computed returned sequences for each element in the batch. The default value is 2.
- `use_cache`: Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. The default value is True.

**Result**:
Each line of the output JSON data is a dict like this:
```
{"question": str, "context": List[str], "ground_truth": str}
```
`ground_truth` is the generated ground truth, based on the question and context provided.
See [ground_truth.jsonl](https://github.com/intel/intel-extension-for-transformers/blob/master/intel_extension_for_transformers/neural_chat/tools/evaluation/data_augmentation/ground_truth.jsonl) for a data file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2023 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{"question": "What types of platforms does the organization focus on?", "answer": "The organization focuses on delivering open software and hardware platforms with industry-defining standards, as well as leadership products, open and secure platforms, and resilient manufacturing."}
{"question": "What are the core values that drive our company's actions?", "answer": "The core values driving the company's actions include focusing on having a positive impact on business, society, and the planet by working together with talented individuals. They also emphasize delivering leadership products, open and secure platforms, and resilient manufacturing to support global digitalization and ensure customer success."}
{"question": "What types of companies does Intel invest in?", "answer": "Intel invests in public and private companies."}
{"question": "How has technology been central to our lives in recent years?", "answer": "In recent years, technology has become more essential as it permeates various aspects of our daily lives. This includes advancements in communication, entertainment, transportation, healthcare, and many other sectors. All these rely heavily on semiconductors, which play a crucial role in powering and enabling these technologies."}
{"question": "What is Intel's focus in terms of delivering leadership products?", "answer": "Intel's focus in terms of delivering leadership products includes providing open and secure platforms as well as resilient manufacturing for enabling global digitalization and fueling customer success."}
{"question": "How has Intel been affected by the COVID-19 pandemic so far, and what?", "answer": "Intel has not provided specific details on how they have been directly affected by the COVID-19 pandemic. However, it can be inferred that like many other companies, they might have experienced challenges related to supply chain disruptions, workforce adjustments, and potential changes in demand for their products due to the global economic impact of the pandemic."}
{"question": "How does the company protect personal data to prevent unauthorized access or misuse?", "answer": "The text provided doesn't specifically mention how the company protects personal data to prevent unauthorized access or misuse. However, it highlights the potential consequences of such incidents, which might imply that they have measures in place to minimize these risks."}
{"question": "What are the conditions for accessing third-party IP?", "answer": "The conditions for accessing third-party IP can vary depending on the specific agreement between the parties involved. However, generally, it includes ensuring availability on commercially reasonable terms or at all."}
{"question": "How many customers contribute to the majority of our revenue?", "answer": "A limited number of customers contribute to the majority of your revenue."}
{"question": "When does Intel plan to deliver on its goal of five manufacturing technology nodes in four years?", "answer": "Intel remains on track to deliver on this goal within four years."}
Loading