Skip to content

[Bug]: Metadata/where edge cases #4388

@hesreallyhim

Description

@hesreallyhim

What happened?

Related: #4346

Description

In addition to the observation in #4346 that document ID can be empty string, I found some edge cases that we may wish to disallow:

collection.upsert(
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.7, 4.3, 3.2],
        [4.7, 4.8, 3.2],
    ],
    metadatas=[
        {"uri": "img1.png", "style": "style1"},
        {"": "img2.png"},
        {"": "", "$nin": "uhoh"},
        {"uri": "img4.png", "computed": "style" + "1"},
        {"uri": "img5.png", "uhoh": "$contains"},
        {"uri": "$contains", "$nin": "$nin", "bool": True, "num": 21},

    ],
    documents=["doc9", "doc2", "doc3", "doc4", "doc5", "doc6"],
    ids=["id1", "id2", "id3", "id4", "id5", "id6"],
)

result1 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"style": {"$eq": "style1"}},
    )
print("result1", result1)

result2 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"$nin": "uhoh"}
    )

print("result2", result2)

result3 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"uhoh": "$contains"}
    )

print("result3", result3)

result4 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"$nin": "$nin"}
    )

print("result4", result4)

result5 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": 5 == 5}
    )

print("result5", result5)

result6 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": True or False}
    )

print("result6", result6)

result7 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": False or 0 or True or 6} # chain of falsey and then match True
    )

print("result7", result7)

result8 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": False or "truthy" or True or 6} # chain with a non-matching truthy, cuts off the real match
    )

print("result8", result8)

result9 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"num": {"$in": list(range(25))}}
)

print("result9", result9)

# result10 = collection.query(
#         query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
#         n_results=15,
#         where={"num": {"$in": list(range(10000000000000000000))}} # DoS attack(?)
# )

# print("result10", result10)

View in this colab notebook:

https://colab.research.google.com/drive/1BKGRLM9CmuGHHFW0hBorlLN6g-Coz2U1#scrollTo=64dWyeEdKAX9

The last one is maybe a potential DoS attack for Chroma Cloud(??)

I think this also means that ID filtering (new feature) will already allow for "operations" ("$gte", etc.) since you can do a lot with list comprehension, and basically simulate the same functionality.

Versions

Chroma 1.0.7
python 3.11.12

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions