GitHub - ycc789741ycc/sentence-embedding-dataframe-cache: Using pandas dataframe to implement sentence embedding store.

Introduction

Implement the sentence embedding retriever with local cache from the embedding store.

Features

Embedding store abstraction class
Support Jina client implementation embedding store
Support LFU, LRU cache eviction policy for limited cache size, if the eviction policy is not specified then won't apply any eviction policy
Save the cache to parquet file
Load the cache from existed parquet file

Quick Start

Environment

Python 3.9
Linux/MacOS

Option 1. Using Jina flow serve the embedding model

Installation

pip install embestore"[jina]"

To start up the Jina flow service with default sentence transformer model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

embestore serve start-jina

Use other sentence transformer model from hugging face

# Take sentence-transformers/all-MiniLM-L6-v2 for example

export SENTENCE_TRANSFORMER=sentence-transformers/all-MiniLM-L6-v2
embestore serve start-jina

Retrieve the embedding

from embestore.store.jina import JinaEmbeddingStore

JINA_EMBEDDING_STORE_GRPC = "grpc://0.0.0.0:54321"


query_sentences = ["I want to listen the music.", "Music don't want to listen me."]

jina_embedding_store = JinaEmbeddingStore(embedding_grpc=JINA_EMBEDDING_STORE_GRPC)
embeddings = jina_embedding_store.retrieve_embeddings(sentences=query_sentences)

>>> embeddings
array([[ 2.26917475e-01,  8.17841291e-02,  2.35427842e-02,
        -3.02357599e-02,  1.15757119e-02, -8.42996314e-02,
         4.42815214e-01,  1.80795133e-01,  1.04702041e-01,
         ...
]])

Stop the docker container

embestore serve stop-jina

Option 2. Using local sentence embedding model

Installation

pip install embestore"[sentence-transformers]"

Serve the sentence embedding model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 by in-memory

from embestore.store.torch import TorchEmbeddingStore

query_sentences = ["I want to listen the music.", "Music don't want to listen me."]


torch_embedding_store = TorchEmbeddingStore()
embeddings = torch_embedding_store.retrieve_embeddings(sentences=query_sentences)

>>> embeddings
array([[ 2.26917475e-01,  8.17841291e-02,  2.35427842e-02,
        -3.02357599e-02,  1.15757119e-02, -8.42996314e-02,
         4.42815214e-01,  1.80795133e-01,  1.04702041e-01,
         ...
]])

Option 3. Inherit from the abstraction class

Installation

pip install embestore

from typing import List, Text

import numpy as np
from sentence_transformers import SentenceTransformer

from embestore.store.base import EmbeddingStore

model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2").eval()


class TorchEmbeddingStore(EmbeddingStore):
    def _retrieve_embeddings_from_model(self, sentences: List[Text]) -> np.ndarray:
        return model.encode(sentences)

Save the cache

torch_embedding_store.save("cache.parquet")

Load from the cache

torch_embedding_store = TorchEmbeddingStore("cache.parquet")

Apply eviction policy

LRU

torch_embedding_store = TorchEmbeddingStore(max_size=100, eviction_policy="lru")

LFU

torch_embedding_store = TorchEmbeddingStore(max_size=100, eviction_policy="lfu")

Road Map

[TODO] Badges

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
doc		doc
embedding_models/jina		embedding_models/jina
embestore		embestore
examples		examples
scripts		scripts
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
all_requirements.txt		all_requirements.txt
dev_requirements.txt		dev_requirements.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Features

Quick Start

Environment

Option 1. Using Jina flow serve the embedding model

Option 2. Using local sentence embedding model

Option 3. Inherit from the abstraction class

Save the cache

Load from the cache

Apply eviction policy

Road Map

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ycc789741ycc/sentence-embedding-dataframe-cache

Folders and files

Latest commit

History

Repository files navigation

Introduction

Features

Quick Start

Environment

Option 1. Using Jina flow serve the embedding model

Option 2. Using local sentence embedding model

Option 3. Inherit from the abstraction class

Save the cache

Load from the cache

Apply eviction policy

Road Map

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages