Features, bugs, research, ... that we are not actively working on

TODO sort and sort out:

- [ ] Models
  - [ ] Better prompting with templates: Setting a mandatory document start #29
  - [ ] Retry with feedback and retry without feedback. #30
  - [ ] Implement "chain of thought" tasks #31
  - [ ] Deal with dependencies requested by LLMs #174
  - [ ] Evaluation run for all "good open weight models" with all available quantizations and different GPUs #209
  - [ ] Exclude `openrouter` models `auto` and `flavor-of-the-week` automatically in the provider
  - [ ] Rethink retry logic for LLM Providers #305
  - [ ] Openrouter Provider preferences #286
  - [ ] Include more models (We have the main problem that there are multiple models coming out every day. We should not wait for a "new version" of the eval, we should test these models right away and compare them. big problem: how do we promote findings?)
    - [ ] Nous-Hermes-2-SOLAR-10.7B; Also "Tree of Thoughts" approach might be interesting as a task
  - [ ] Maybe use https://huggingface.co/inference-endpoints/dedicated
- [ ] Metrics & Reporting
  - [ ] Evaluation folder with date cannot be created on windows #151
  - [ ] extend the `report` command such that it takes result csv's and automatically
    - [ ] does the summing and aggregation (if we still want that to be a separate step)
    - [ ] finds the maximum scores for that evaluation run
    - [ ] once we have the leaderboard, we basically want to configure the repository such that we just add a model to some config somewhere and the GitHub actions run automatically and benchmark this model
    - [ ] or in a similar fashion, we just do a new release and the GitHub actions run automatically and benchmark everything for the new version
  - [ ] Automatically updated leader board for this repository: #26
    - [ ] Take a look at current leaderboards and evals to know what could be interesting `Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)`
  - [ ] Scoring
    - [ ] Infer if a model produced "too much" code #44
    - [ ] Introduce an AST-differ that also gives metrics #80 
    - [ ] Add linters where each error is a metric #81 
    - [ ] Include metrics about the models for comparing models #82
    - [ ] Coverage for Java is tracked for lines, while Go is tracked for ranges #193
    - [ ] Weight "executed code" more prominently #233
    - [ ] AST differ https://github.com/symflower/eval-dev-quality/issues/80
    - [ ] Linters https://github.com/symflower/eval-dev-quality/issues/81
    - [ ] Automatically infer "Extra code" https://github.com/symflower/eval-dev-quality/issues/44
    - [ ] Figure out the "perfect" coverage score so we can display percentage of coverage reached
    - [ ] Make coverage metric fair
      - "Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite." -> only Symflower coverage will make this fair
    - [ ] distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
    - [ ] Failing tests should receive a score penalty
  - [ ] Metrics
    - [ ] Track query tokens and save them to CSV #347
    - [ ] Non-benchmark metrics (cost, weights open, ...) #82
      - [ ] Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
    - [ ] Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
  - [ ] Reporting
    - [ ] Do an up-to-date leaderboard/dashboard for current models current evaluation #26
    - [ ] Bar charts should have have their value on the bar. The axis values do not work that well
    - [ ] Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
    - [ ] Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
    - [ ] Scoring, Categorization, Bar Charts split by language.
    - [ ] Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
    - [ ] deep-dive content
      - [ ] What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
      - [ ] Are there big LLMs that totally fail?
      - [ ] Are there small LLMs that are surprisingly good?
      - [ ] What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
    - [ ] Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
    - [ ] Categorize by parameters/experts https://www.reddit.com/r/LocalLLaMA/comments/1cdivc8/comment/l1davhv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
    - [ ] Compare input/output/request/... costs https://twitter.com/oleghuman/status/1786296672785420744
- [ ] Logging
  - [ ] Remove absolute paths completely e.g. in stack traces too.
  - [ ] Log request and response in their own files, so both can be used 1:1 (character for character) directly for debugging them: https://github.com/symflower/eval-dev-quality/issues/204
- [ ] Tooling & Installation
  - [ ] CI and tools
    - [ ] `InstallToolsPath` is not used for test execution (`make test`) #93
    - [ ] Test for pulling Ollama model is flaky #135 
    - [ ] Flaky CI because of corrupted Z3 installation #107
    - [ ] Follow up: Ollama Support #100
    - [ ] ollama_llama_server and other background processes we start must be killed on CTRL+C #164
    - [ ] Enable Ruby tests in Windows CI #334
    - [ ] Move Dependency installation of Docker into multistage builds #319
  - [ ] Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
  - [ ] Automatic tool installation with fixed version
    - [ ] Go
    - [ ] Java
  - [ ] Ensure that non-critical CLI input validation (such as unavailable models) does not panic
  - [ ] Ollama support
    - [ ] Install and test Ollama on MacOS
    - [ ] Install and test Ollama on Windows
  - [ ] Allow to forward CLI commands to be evaluated: https://github.com/symflower/eval-dev-quality/pull/27#issuecomment-2077677950
  - [ ] Refactor `Model` and `Provider` to be in the same package https://github.com/symflower/eval-dev-quality/pull/121#discussion_r1603371915
- [ ] Evaluation
  - [ ] Interactive result comparison #208
  - [ ] Benchmark quantized models, because they need less memory
    - https://github.com/symflower/eval-dev-quality/issues/209
      - https://twitter.com/HaihaoShen/status/1789178048308543688
  - [ ] Do an evaluation with different temperatures
  - [ ] Java
    - [ ] Let the Java test case for `No test files` actually identify and error that there are no test files (needs to be implemented in `symflower test`)
  - [ ] LLM
    - [ ] Improve LLM prompt
      - [ ] Take a look at https://x.com/dottxtai/status/1798443290913853770
      - [ ] Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
        - [ ] We need to fork or use another package https://github.com/symflower/eval-dev-quality/issues/79#issuecomment-2082547660
      - [ ] Add markers for system and user e.g. https://github.com/symflower/eval-dev-quality/pull/27#issuecomment-2058992479
      - [ ] Think about a standardized way of print outputs e.g. JSON https://twitter.com/WuMinghao_nlp/status/1789094583290507626
  - [ ] Prepare language and evaluation logic for multiple files:
    - [ ] Use `symflower symbols` to receive files
  - [ ] Evaluation tasks
    - [ ] Evaluation task: TDD #194
    - [ ] Assess failing tests #235
    - [ ] Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
    - [ ] Add evaluation task for code refactoring: two function with the same code -> extract into a helper function
    - [ ] Add evaluation task for implementing and fixing bugs using TDD
  - [ ] Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
  - [ ] Code repair
    - [ ] 0-shot, 1-shot, ...
      - [ ] With LLM repair
      - [ ] With tool repair
  - [ ] Do test file paths through
    - [ ] `symflower symbols`
    - [ ] Task for models
  - [ ] Move towards generated cases so models cannot integrate fixed cases to always have 100% score
  - [ ] Think about adding more trainings data generation features: This will also help with dynamic cases
    - [ ] Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
- [ ] Documentation
  - [ ] Clean up and extend README
    - [ ] Better examples for contributions
    - [ ] Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
    - [ ] Extend "how to extend the benchmark" section with instructions on how to add new tasks + languages, so we can even use LLMs to add new stuff 
  - [ ] Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour break in between to not run into cached responses.
- [ ] Content
  - [ ] Benchmark that showcases base-models vs their fine-tuned coding model e.g. in v0.5.0 we see that Codestral, codellama, ... are worse
  - [ ] Snowflake against Databricks would be a nice comparison since they align company-wise and are new
  - [ ] Write Tutorial for using Ollama
  - [ ] YouTube video for using Ollama
  - [ ] Blog post about the different suffixed of models e.g. "chat" and "instruct" and eval them somehow. Idead from https://www.reddit.com/r/LocalLLaMA/comments/1bz5oyx/comment/kyrfap4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
  - [ ] Blog post about HumanEval
  - [ ] Blog post about training a small LLM directly on HumanEval
  - [ ] Blog post about "non-determinism of LLMs" https://community.openai.com/t/a-question-on-determinism/8185 good starting point, and how we can make them at least more stable.
  - [ ] Blogpost idea: misleading comments, weird coding style... how much does it take to confuse the most powerful AI? @ahumenberger
    - [ ] Maybe not only comments. What about obfuscated code, e.g. function and variables names are just random strings?
- [ ] Research
  - [ ] Take a look at https://twitter.com/SMT_Solvers/status/1783540994304066006
  - [ ] Take a look at all OpenRouter's features of the API e.g. https://openrouter.ai/docs#parameters
    - [ ] https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l29o97q/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button mentioned that repetition_penalty disabled helps with performance and better results for coding.
    - [ ] Requested new models for the eval
          https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l2d4im0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
  - [ ] Look at what Paul Gauthier is doing with benchmarking of Aider (and Aider too) https://twitter.com/paulgauthier/status/1787827703158386921?s=46 seems like a perfect match for what we want to do as tasks
  - [ ] Look at MLX which maybe helps with execution for us https://twitter.com/loganthorneloe/status/1787845883519775120
  - [ ] Take a look at xLSTM https://twitter.com/HochreiterSepp/status/1788072466675335185
  - [ ] Take a look at eval https://twitter.com/JiaweiLiu_/status/1783959954321252697
  - [ ] Take a look at evaluation framework https://twitter.com/hamelhusain/status/1788936691576975382?s=61
  - [ ] Dig through https://arxiv.org/pdf/2405.14782 thanks to https://x.com/clefourrier/status/1793913394871062970
  - [ ] Take a look at https://x.com/dottxtai/status/1798443290913853770
- [ ] Think about a commercial effort of the eval, that we can balance some of the costs that goes into maintaining this eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Features, bugs, research, ... that we are not actively working on #387

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Features, bugs, research, ... that we are not actively working on #387

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions