-
Couldn't load subscription status.
- Fork 6
Revised metrics proposal #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
riedgar-ms
wants to merge
44
commits into
fairlearn:master
Choose a base branch
from
riedgar-ms:riedgar-ms/revised-metrics
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
8c96695
Starting work
riedgar-ms 531c195
More text
riedgar-ms 66eab2f
Working through next bit
riedgar-ms cf8ae38
Adding another example
riedgar-ms 1d8987c
Add in multiple sensitive features and segmentation
riedgar-ms 6d5c5e3
Sketch out the multiple metrics
riedgar-ms 00e0a1a
Forgot to add a line
riedgar-ms 5ce78ec
Think a bit about multiple metrics and getting the names
riedgar-ms 9061c70
Add a note about some convenience wrappers
riedgar-ms 5cd4e88
Some notes on pitfalls
riedgar-ms 75e1d1d
Some more questions
riedgar-ms aac662c
Change Segmented Metrics to be Conditional Metrics
riedgar-ms bf96d2b
Working through some of the suggested changes
riedgar-ms 416d427
Some more fixes
riedgar-ms fbb634f
More fixes
riedgar-ms 77777ba
Fix an errant sex
riedgar-ms 6d69fa9
Typo fix
riedgar-ms 13107f4
Update the metric_ property
riedgar-ms 048e484
Add extra clarifying note about datatypes for intersections
riedgar-ms 9ff4e91
Expand on note for conditional parity input types
riedgar-ms 5040649
Further updates to the text
riedgar-ms 210872e
Add some suggestions for alternative names
riedgar-ms 3ba87aa
Adding notebook of samples
riedgar-ms c13da58
More examples in notebook
riedgar-ms 13069d5
Put in remaining comparisons
riedgar-ms 2e10c46
Starting to change over to constructor method etc.
riedgar-ms 8296f3a
More on methods
riedgar-ms f755711
More changes based on prior discussion
riedgar-ms 0f28fe3
Another correction
riedgar-ms 2ea2037
Some fixes to remove `group_summary()` (not yet complete)
riedgar-ms 41bb325
Some extensive edits
riedgar-ms 205c239
Add make_grouped_scorer()
riedgar-ms 70d8811
Errant group_summary
riedgar-ms 96703c6
Add link to SLEP006
riedgar-ms 7521261
Minor update to notebook
riedgar-ms f9ed32c
Some small updates to the proposal
riedgar-ms 091c47e
Add make_derived_metric back in
riedgar-ms 180370b
Add note about meetings
riedgar-ms 16983bf
Merge remote-tracking branch 'upstream/master' into riedgar-ms/revise…
riedgar-ms acfacab
Update after today's discussion
riedgar-ms 697eee0
Merge remote-tracking branch 'upstream/master' into riedgar-ms/revise…
riedgar-ms f0cdb5c
Update to reflect reality
riedgar-ms 4feb19f
Remove uneeded notebook
riedgar-ms 5649f92
Fix the odd typo
riedgar-ms File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,298 @@ | ||
| # Updates for Metrics | ||
|
|
||
| This is an update for the existing metrics document, which is being left in place for now as a point of comparison. | ||
|
|
||
| We are having meetings to discuss this proposal. | ||
| Please reach out to `[email protected]` if you would like to join. | ||
|
|
||
| ## Assumed data | ||
|
|
||
| In the following we assume that we have variables of the following form defined: | ||
|
|
||
| ```python | ||
| y_true = [0, 1, 0, 0, 1, 1, ...] | ||
| y_pred = [1, 0, 0, 1, 0, 1, ...] # Content can be different for other metrics (see below) | ||
| A_1 = [ 'C', 'B', 'B', 'C', ...] | ||
| A_2 = [ 'M', 'N', 'N', 'P', ...] | ||
| A = pd.DataFrame(np.transpose([A_1, A_2]), columns=['SF 1', 'SF 2']) | ||
|
|
||
| weights = [ 1, 2, 3, 2, 2, 1, ...] | ||
| ``` | ||
|
|
||
| We actually seek to be very agnostic as to the contents of the `y_true` and `y_pred` arrays; the meaning is imposed on them by the underlying metrics. | ||
| Here we have shown binary values for a simple classification problem, but they could be floating point values from a regression, or even collections of classes and associated probabilities. | ||
|
|
||
| ## Basic Calls | ||
|
|
||
| ### Existing Syntax | ||
|
|
||
| Our basic method is `group_summary()` | ||
|
|
||
| ```python | ||
| >>> result = flm.group_summary(skm.accuracy_score, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1) | ||
| >>> print(result) | ||
| {'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}} | ||
| >>> print(type(result)) | ||
| <class 'sklearn.utils.Bunch'> | ||
| ``` | ||
| The `Bunch` is an object which can be accessed in two ways - either as a dictionary - `result['overall']` - or via properties named by the dictionary keys - `result.overall`. | ||
| Note that the `by_group` key accesses another `Bunch`. | ||
|
|
||
| We allow for sample weights (and other arguments which require slicing) via `indexed_params`, and passing through other arguments to the underlying metric function (in this case, `normalize`): | ||
| ```python | ||
| >>> flm.group_summary(skm.accuracy_score, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1, | ||
| indexed_params=['sample_weight'], | ||
| sample_weight=weights, normalize=False) | ||
| {'overall': 20, 'by_group': {'B': 60, 'C': 21}} | ||
| ``` | ||
|
|
||
| We also provide some wrappers for common metrics from SciKit-Learn: | ||
| ```python | ||
| >>> flm.accuracy_score_group_summary(y_true, y_pred, | ||
| sensitive_features=A_1) | ||
| {'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}} | ||
| ``` | ||
|
|
||
| ### Proposed Change | ||
|
|
||
| We propose to introduce a new object, the `MetricFrame` (name discussion below). | ||
| Users will compute metrics by passing arguments into the constructor: | ||
| ```python | ||
| >>> metrics = MetricFrame(skm.accuracy_score, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1) | ||
| <class 'MetricFrame'> | ||
| >>> metrics.overall | ||
| accuracy_score 0.4 | ||
| >>> metrics.by_group | ||
| accuracy_score | ||
| B 0.6536 | ||
| C 0.213 | ||
| ``` | ||
| The `overall` property is a Pandas Series, indexed by the name of the underlying metric. | ||
| The `by_group` property is a Pandas DataFrame, with a column named by the underlying metric. | ||
| The rows of the `by_group` property are set to the unique values of the `sensitive_feature=` argument. | ||
|
|
||
| Sample based parameters (such as sample weights) can be passed in using the `sample_params=` | ||
| argument: | ||
| ```python | ||
| >>> metrics = MetricFrame(skm.accuracy_score, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1, | ||
| sample_params={'sample_weight': weight}) | ||
| ``` | ||
| If the underlying requires other arguments (such as the `beta=` argument to `fbeta_score()`), | ||
| then `functools.partial()` must be used. | ||
|
|
||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Obtaining Scalars | ||
|
|
||
| ### Existing Syntax | ||
|
|
||
| We provide methods for turning the `Bunch`es returned from `group_summary()` into scalars: | ||
| ```python | ||
| >>> difference_from_summary(result) | ||
| 0.4406 | ||
| >>> ratio_from_summary(result) | ||
| 0.3259 | ||
| >>> group_max_from_summary(result) | ||
| 0.6536 | ||
| >>> group_min_from_summary(result) | ||
| 0.2130 | ||
| ``` | ||
| We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_min()` for user convenience. | ||
|
|
||
| ### Proposed Change | ||
|
|
||
| Although the functionality of the `group_max_from_summary()` and `group_min_from_summary()` can be accessed by calling `metrics.by_group.min()` and `metrics.by_group.max()`, we will provide `.group_min()` and `.group_max()` methods for completeness. | ||
|
|
||
| For differences and ratios, we will provide `.difference()` and `.ratio()` methods. | ||
| These will take an optional argument of `method=`, to indicate how the values are to be calculated. | ||
| For now, the valid values of this argument will be `between_groups` (indicating that just the values in the | ||
| `by_groups` property should be used) and `to_overall` (indicating that all results should be calculated | ||
| relative to the appropriate values in the `overall` property). | ||
| First for computing the difference: | ||
| ```python | ||
| >>> metrics.difference(method='between_groups') | ||
| accuracy_score 0.4406 | ||
| dtype: float64 | ||
| >>> metrics.difference(method='to_overall') | ||
| accuracy_score 0.2563 # max(abs(0.6536-0.4), abs(0.213-0.4)) | ||
| dtype: float64 | ||
| ``` | ||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Note that the result type is a Series (for reasons which will become clear below). | ||
| The `ratio()` method would behave in a similar way: | ||
| ```python | ||
| >>> metrics.ratio(method='between_groups') | ||
| accuracy_score 0.3259 | ||
| dtype: float64 | ||
| >>> metrics.ratio(method='to_overall') | ||
| accuracy_score 0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536)) | ||
| dtype: float64 | ||
| ``` | ||
|
|
||
| ## Intersections of Sensitive Features | ||
|
|
||
| ### Existing Syntax | ||
|
|
||
| Our current API does not support evaluating metrics on intersections of sensitive features (e.g. "black and female", "black and male", "white and female", "white and male"). | ||
| To achieve this, users currently need to write something along the lines of: | ||
| ```python | ||
| >>> A_combined = A['SF 1'] + '-' + A['SF 2'] | ||
|
|
||
| >>> accuracy_score_group_summary(y_true, y_pred, | ||
| sensitive_features=A_combined) | ||
| { 'overall': 0.4, by_groups : { 'B-M':0.4, 'B-N':0.5, 'B-P':0.5, 'C-M':0.5, 'C-N': 0.6, 'C-P':0.7 } } | ||
| ``` | ||
| This is unecessarily cumbersome. | ||
| It is also possible that some combinations might not appear in the data (especially as more sensitive features are combined), but identifying which ones were not represented in the dataset would be tedious. | ||
|
|
||
|
|
||
| ### Proposed Change | ||
|
|
||
| If `sensitive_features=` is a DataFrame (or list of Series, or list of numpy arrays, or a 2D numpy array etc.), we can generate our results in terms of a MultiIndex. Using the `A` DataFrame defined above, a user might write: | ||
| ```python | ||
| >>> result = MetricFrame(skm.accuracy_score, | ||
| y_true, y_pred, | ||
| sensitive_features=A) | ||
| >>> result.by_groups | ||
| accuracy_score | ||
| SF 1 SF 2 | ||
| B M 0.50 | ||
| N 0.40 | ||
| P 0.55 | ||
| C M 0.45 | ||
| N 0.70 | ||
| P 0.63 | ||
| ``` | ||
| If a particular combination of sensitive features had no representatives, then we would return `NaN` for that entry in the Series. | ||
| Although this example has passed a DataFrame in for `sensitive_features=` we should aim to support lists of Series and `numpy.ndarray` as well. | ||
|
|
||
| The `differences()` and `ratios()` methods would act on this DataFrame as before. | ||
|
|
||
| ## Control Metrics | ||
|
|
||
| Control Metrics (alternatively known as Conditional Metrics) are specified separately from the sensitive features, since | ||
| the aggregation functions discussed above do not act across them. | ||
| Within the `by_group` property, they behave like additional sensitive features. | ||
|
|
||
| ### Existing Syntax | ||
|
|
||
| Not supported. | ||
| Users would have to devise the required code themselves | ||
|
|
||
| ### Proposed Change | ||
|
|
||
| The `MetricFrame` constructor will need an additional argument `control_features=` to specify the control features. | ||
| It will accept similar types to the `sensitive_features=` argument. | ||
| Suppose we have another column called `income_level` with unique values 'Low' and 'High' | ||
| ```python | ||
| >>> metric = MetricFrame(skm.accuracy_score, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1, | ||
| control_features=income_level) | ||
| >>> metric.overall | ||
| accuracy_score | ||
| High 0.46 | ||
| Low 0.61 | ||
| dtype: float64 | ||
| >>> metric.by_group | ||
| accuracy_score | ||
| High B 0.40 | ||
| C 0.55 | ||
| Low B 0.55 | ||
| C 0.65 | ||
| ``` | ||
| The `overall` property is now a DataFrame, with rows corresponding to the unique values of the control feature(s). | ||
| Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) control feature. | ||
|
|
||
| Note that it is possible to have multiple sensitive features, and multiple control features. | ||
| Operations such as `.group_max()` and `.difference()` will act on each combination of control feature values, and aggregate across the sensitive features. | ||
| So for example | ||
| ```python | ||
| >>> metric.difference(method='minmax') | ||
| accuracy_score | ||
| High 0.15 | ||
| Low 0.10 | ||
| >>> metric.difference(method='overall') | ||
| accuracy_score | ||
| High 0.09 | ||
| Low 0.06 | ||
| ``` | ||
|
|
||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| If it users found it more convenient to have the conditional features be sub-columns on the metrics, then the `unstack()` method of the pandas DataFrame can be used. | ||
|
|
||
| ## Multiple Metrics | ||
|
|
||
| Finally, we can also allow for the evaluation of multiple metrics at once. | ||
|
|
||
| ### Existing Syntax | ||
|
|
||
| This is not supported. | ||
| Users would have to devise their own means. | ||
|
|
||
| ### Proposed Change | ||
|
|
||
| We allow a dictionary of metric functions in the call to group summary. | ||
| The properties then extend themselves: | ||
| ```python | ||
| >>> result = MetricFrame({'accuracy':skm.accuracy_score, 'precision':skm.precision_score}, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1) | ||
| >>> result.overall | ||
| accuracy 0.3 | ||
| precision 0.5 | ||
| dtype: float64 | ||
| >>> result.by_groups | ||
| accuracy precision | ||
| 'B' 0.4 0.7 | ||
| 'C' 0.6 0.75 | ||
| ``` | ||
| Note that we use the dictionary keys, rather than the function names in the output. | ||
| This should generalise to the other methods described above. | ||
|
|
||
| When users wish to use the `sample_params=` arguments, then they should pass in a dictionary of dictionaries, matching the functions by key: | ||
| ```python | ||
| metric_fns = { 'accuracy':skm.accuracy_score, 'precision':skm.precision_score} | ||
| sample_params = { 'accuracy':{'sample_weight':weight}, 'precision':{'sample_weight':weight}} | ||
| result = MetricFrame(metric_fns, | ||
| y_true, y_pred, | ||
| sensitive_features=A_1, | ||
| sample_params=sample_params) | ||
| ``` | ||
| The outer set of dictionary keys given to `sample_params=` should be a subset of the keys of the metric function dictioary. | ||
| This is somewhat repetitious (see the `sample_weight` above), but trying to share some arguments between functions is likely to lead to a worse mess. | ||
|
|
||
| ## Generality | ||
|
|
||
| Throughout this document, we have been describing the case of classification metrics. | ||
| However, we do not actually require this. | ||
| It is the underlying metric function which gives meaning to the `y_true` and `y_pred` lists. | ||
| So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `MetricFrame` does not actually care about their datatypes. | ||
| For example, each entry in `y_pred` could be a dictionary of predicted classes and accompanying probabilities. | ||
| Or the user might be working on a regression problem, and both `y_true` and `y_pred` would be floating point numbers (or `y_pred` might even be a tuple of predicted value and error). | ||
| So long as the underlying metric understands the data structures, `MetricFrame` will not care. | ||
|
|
||
| There will be an effect on the `difference()` and `ratio()` methods. | ||
| Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not. | ||
| After all, what does "take the ratio of two confusion matrices" even mean? | ||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| We should try to trap these cases, and throw a meaningful exception (rather than propagating whatever exception happens to emerge from the underlying libraries). | ||
| Since we know that `difference()` and `ratio()` will only work when the metric has produced scalar results, which should be a straightforward test using [`isscalar()` from Numpy](https://numpy.org/doc/stable/reference/generated/numpy.isscalar.html). | ||
|
|
||
| ## Pitfalls | ||
|
|
||
| There are some potential pitfalls which could trap the unwary. | ||
|
|
||
| The biggest of these are related to missing classes in the subgroups. | ||
| To take an extreme case, suppose that the `B` group speciified by `SF 1` were always being predicted classes H or J, while the `C` group was always predicted classes K or L. | ||
| The user could request precision scores, but the results would not really be comparable between the two groups. | ||
| With intersections of sensitive features, cases like this become more likely. | ||
|
|
||
| Metrics in SciKit-Learn usually have arguments such as `pos_label=` and `labels=` to allow the user to specify the expected labels, and adjust their behaviour accordingly. | ||
| However, we do not require that users stick to the metrics defined in SciKit-Learn. | ||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Unfortunately, the generality of `MetricFrame` means that we cannot solve this for the user. | ||
| It cannot even tell if it is evaluating a classification or regression problem. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.