Skip to content

MSMARCO 10m Recall Operations#1092

Merged
john-wagster merged 3 commits intoelastic:masterfrom
john-wagster:msmarco_10m_recall_ops
Mar 18, 2026
Merged

MSMARCO 10m Recall Operations#1092
john-wagster merged 3 commits intoelastic:masterfrom
john-wagster:msmarco_10m_recall_ops

Conversation

@john-wagster
Copy link
Contributor

added 10m recall top 100 dataset brute forced using the following approach:

  • ingest 10m
  • get the existing queries-recall.json
  • script over that data pulling out each embedding emb field and rerunning a brute force knn using script score like this:
{
  "query": {
    "script_score": {
      "query": {"match_all": {}},
      "script": {
        "source": "double value = dotProduct(params.query_vector, 'emb'); return sigmoid(1, Math.E, -value);",
        "params": {
          "queryVector": [0.02849635089343171, 0.027870560018247403]
        }
      }
    }
  }
}
  • regenerate a new queries-recall.json by replacing the ids list with the ids generated by brute forcing over only 10m docs. That reflects what's in the msmarco-v2-vector/queries-recall-10m.json.bz2 file.

@john-wagster
Copy link
Contributor Author

john-wagster commented Mar 17, 2026

Did a run of this just to make sure it worked appropriately:

config

Note: data was already indexed as part of a prior run but the mechanics as they stand look for initial_indexing_ingest_doc_count to signal that a 10m recall run is requested.

"track.params": {
  "mapping_type": "vectors-only",
  "vector_index_type": "bbq_disk",
  "initial_ingest_clients": 16,
  "initial_indexing_ingest_doc_count": 10000000,
  "corpora": ["msmarco-v2_base64-initial-indexing-1"],
  "include_initial_indexing": false,
  "include_recall": true,
  "include_parallel_indexing": false,
  "search_ops": [[10, 100, 5, 1]],
  "standalone_search_warmup_iterations": 1000,
  "standalone_search_iterations": 1000
}

outcome

knn-recall-10-100-5-1	0.826

@john-wagster
Copy link
Contributor Author

also worth noting that I generated top 10 as well but did not include that in this PR in case we find that more useful (faster?) for whatever reason. This recall test seems pretty fast though.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice LGTM

@john-wagster john-wagster merged commit 0247086 into elastic:master Mar 18, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants