Scoring function of rarity simulation is borked

The current version of the scoring function `compute_scores` seems to weigh non-rare states higher than rare ones. Instead of selecting states that explored rare value combinations, it prefers the ones where the value/address count is higher. Let us consider the following example input.

```
uint64_t main() {
  uint64_t  i;
  uint64_t  j;
  uint64_t* x;

  i = 0;
  j = 0;
  x = malloc(8);

  *x = 0;

  read(0, x, 1);

  if (*x < 25) {
    *x = 0;
    while (i < 150) {
      i = i + 1;
    }
    return 1;
  } else {
    *x = 0;
    while (j < 150) {
      j = j + 1;
    }
    return 0;
  }
}
```

In this example I expect the simulation to fall into the true-branch in about 10% of all runs. Hence incrementing the memory location of `i` will be rarer than incrementing `j` with a high probability. I'd expect rarity simulation to select states that are looping in the true-branch. Note that the looping value of 150 was chosen so that a state needs to survive and be selected for exactly one iteration to reach an exit path (assuming the default step-size of 1000).

This however does not happen (when using the default `harmonic` mean), see the following run. I also dump the scores vector for debugging purposes and to prove that at least one state did fall into true-branch (two in this case, the ones with the lower score):

```
$ time ./target/debug/monster rarity --states 10 --selection 3 --iterations 2 ./examples/select-rare.m
Running rarity simulation round 1...
Creating 10 new states (0 copied, 10 new)
Running engines (took 23.706977ms)
Remove 0 successfully exited states from selection
Scoring states (took 402.021139ms)
scores: [0.14409221902017258, 0.14727540500736333, 0.14727540500736333, 0.14409221902017258, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333]
selecting rarest 3 states
Running rarity simulation round 2...
Creating 7 new states (4 copied, 3 new)
Running engines (took 14.214393ms)
Remove 7 successfully exited states from selection
Scoring states (took 126.964981ms)
scores: [0.013235294117646955, 0.013235294117646955, 0.01311953352769669]
selecting rarest 3 states
no bug found in binary

real	0m1,181s
user	0m1,091s
sys	0m0,084s
```

When running the same example with the non-default `arithmetic` mean, the rarer state is preferred, because `score_with_mean` returns `Ordering::Less` in this case. See the following run as an example. Again notice the single state with a lower score (which incidentally does get selected this time):

```
$ time ./target/debug/monster rarity --states 10 --selection 3 --iterations 2 --mean arithmetic ./examples/select-rare.m
Running rarity simulation round 1...
Creating 10 new states (0 copied, 10 new)
Running engines (took 24.153033ms)
Remove 0 successfully exited states from selection
Scoring states (took 403.984164ms)
scores: [1065.6, 1065.6, 1065.6, 1062.4, 1065.6, 1065.6, 1065.6, 1065.6, 1065.6, 1065.6]
selecting rarest 3 states
Running rarity simulation round 2...
Creating 7 new states (4 copied, 3 new)
bug found:
exit code greater than zero
concrete inputs read: [21]
pc: 0x1005c


real	0m1,050s
user	0m0,971s
sys	0m0,076s
```

Even though the `arithmetic` mean works out in this example, I still do believe both variants to be somewhat faulty. Here is my understanding of which scoring functions are currently implemented:

- `harmonic`: The harmonic mean over all (non-inverse) counters for a state, but with `n` being the number of states instead of the number of counters in the state.
- `arithmetic`: The arithmetic mean over all (non-inverse) counters for a state, but again with `n` being the number of states instead of the number of counters in the state. This time using `Ordering::Less` which somewhat hides the issue.

Here is what I think would be a better alternative for what we could try:

- `sum`: The sum of inverse(!) counter values, no mean calculation. I do believe this is the scoring function described in the paper originally, we should at least add it to our portfolio I think, even if we decide not to make if the default.
- `arithmetic*`: The arithmetic mean of the inverse(!) counter values. But importantly with `n` being the number of counters in the given state, otherwise we are just dividing by a constant and not normalizing with respect to the number of counters (i.e. number of memory locations with a concrete value) in each individual state.
- `harmonic*`: The harmonic mean of the inverse(!) counter values. Again with `n` set like above for the same reason.

Note that all three proposed scores use the intuitive `Ordering::Greater` because higher scores represent rarer states in all three cases. The inverse was taken on the individual counter values before mean calculation, not by flipping the interpretation of the mean. I do believe this makes a difference in semantics. And it also makes interpretation of the score uniform, higher is rarer, higher is better.

WDYT about this alternative? Or am I completely off base here? All comments welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring function of rarity simulation is borked #114

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scoring function of rarity simulation is borked #114

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions