Conversation
feat: create markdown version of the trust meter introduction google doc
section. The Problem section introduces the issues associated with outliers. The Purpose section is revised to reflect the scope of the document (it applies to outliers - or at least those who are members of marginalized groups, not just those with disabilities). The Scope section is revised to include systems that serve an advisory role, such as have been discussed by the Committee.
| There may not yet be many of these for AI systems, but with careful presentation | ||
| we can likely use some from older systems. (2) it would be great to have |
There was a problem hiding this comment.
Originally by Jutta on January 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
We have been collecting examples from harm and incident databases that are moderated and verified.
There was a problem hiding this comment.
Originally by @jasonjgw on February 9, 2026 in a comment on the Trust Meter Introduction Google Doc.
We still don't have a good survey of potential harms, partly because there
isn't (or at least, I don't have) a suitable variety of example applications.
The harms caused are, I assume, related to the purposes for which systems may
be used. For example, the risks associated with a tool for recommending
administrative decisions are different from those of a system that assists in
research or summarizes documentation.It also isn't obvious that the harms of under-representation are different
from those of biased representation, particularly if the effects on a system's
behavior are indistinguishable.
| The purpose of this note is to outline some of the potential problems for people | ||
| with disabilities and those who support them posed by AI tools, and approaches |
There was a problem hiding this comment.
Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
as well as other individuals and groups who are statistical outliers for purposes of the relevant AI application. We'll need a section that discusses and provides representative examples of outliers. People may have outlier status due to protected characteristics (race/ethnicity, gender, disability, socio-economic position, etc.) alone or in various combinations, i.e., intersectionality. Presumably they may be outliers simply in virtue of being different from most of the population with respect to a variable relevant to the AI application, even if unrelated to a characteristic that typically give rise to discriminatory treatment in society.
| to mitigating these problems. The focus is on problems that arise from people | ||
| with disabilities being different, or represented as being different, from other | ||
| people, rather than on the problem of bias, that arises when AI tools may embed | ||
| biased attitudes about people with disabilities. |
There was a problem hiding this comment.
Originally by Jutta on January 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
or because people are under-represented in the data.
There was a problem hiding this comment.
Originally by @jasonjgw on February 9, 2026 in a comment on the Trust Meter Introduction Google Doc.
As a preliminary
questions of scope, however, suppose that a model of either type derives very
little information from its training corpus about people with disabilities,
and what it does learn tends to reflect stereotypes. This is a combination of
biased information (not our topic, if I understand correctly) and limited
information (the people in question are greatly under-represented in the
corpus), hence the two problems compound. Thus it might be worth acknowledging
an interaction between the two issues, even though biased information (by
contrast with missing information and outlier status) isn't our central topic
here.
| - provides foundational guidance to support adoption but does not prescribe | ||
| conformance requirements | ||
|
|
||
| ## Statistical Discrimination |
There was a problem hiding this comment.
Originally by @vr619536 on February 18, 2026 in a comment on the Trust Meter Introduction Google Doc.
will want to expand on these. In part wanting to connect the doc to the SOW and original TOR
| - The technical specification applies to machine‑learning‑based classification | ||
| systems used in decision‑making. |
There was a problem hiding this comment.
Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
Does this include systems providing advice or information that may be used in making decisions? For example, AI Answers isn't used directly in decision-making, but the information and references it offers may influence subsequent decisions.
| The people who have the most stake in the correct operation of a system are | ||
| often the people whose cases are being handled. There should be an easy feedback | ||
| system in place so that users can speak up when they feel the system has not | ||
| handled their case correctly. |
There was a problem hiding this comment.
Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
Moreover, if a system is used in decision-making, standards of procedural fairness (e.g., administrative law requirements) may necessitate the availability of procedures by which those affected can challenge decisions and invoke additional human review.
| For ET systems ##can someone look in the literature for how one can improve a | ||
| classifier system?## |
There was a problem hiding this comment.
Originally by @klown on February 4, 2026 in a comment on the Trust Meter Introduction Google Doc.
I started to look into this, and found a high-level overview article entitled "What to Do When Your Classification Model Isn’t Performing Well". It lists approximately 17 techniques across 7 broad categories. The categories are performance metrics, data quality, feature performance, model architecture and hyper-parameters, over- and under-fitting, cross-validation, and ensemble methods. This is a rather large landscape of things to consider. I can't tell which are the important ones in this context.
| Another approach that could be tried would be adding mishandled cases, with the | ||
| proper handling described, to future prompts. Modern FB systems support very | ||
| long prompts, so a good deal of corrective information could be added in this | ||
| way. |
There was a problem hiding this comment.
Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
with the qualification that the system may not reliably act on this guidance as desired.
| As with other aspects of these systems, these improvement processes need to be | ||
| monitored and checked. For example, adding a case with its proper processing | ||
| does not guarantee that the system will respond correctly to future cases. |
There was a problem hiding this comment.
Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
Is there a risk that improving a system's performance in some cases may effectively reduce its performance in others? Does this necessitate re-testing of a large set of cases before deploying an update?
| called for. These same problems beset human-provided services, too. Can we | ||
| harness AI, with its flexibility, and ability to process vast amounts of | ||
| information, to do better? |
There was a problem hiding this comment.
Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.
I think (as Clayton notes informally in the cover letter at the start of the text)a case can be made for adding a section to the document that describes and exemplifies the harms which systems may cause, directly or indirectly. Guidance could be given on how to decide the threshold question of whether to deploy a machine learning-based system for a given purpose, in light of the potential harms it may cause and the risks of alternative (including manual or other conventional technological means) of achieving the same task objectives. Reference may need to be made in this connection to AI ethics literature. In addition, should a summary section be added to the document that captures the considerations which should be taken into account at each stage of a system's development and use (e.g., design, implementation, pre-operational testing, deployment, monitoring and update)?
under-represented and misrepresented in data provided to AI systems.
|
Section 3 ("Statistical Discrimination") seems duplicative of the material in the preceding section. Maybe it should be deleted, or perhaps rewritten with new material. |
|
If appropriate in a Canadian standards committee draft, I think the Cover Note should be revised as a request for comments from reviewers, emphasizing the lacunae in the circulated version of the document. The means provided for submitting review comments (email, GitHub, and any others) should be specified. I can prepare a PR if this approach is considered desirable. |
and amplifying the preceding definitions in this section rather than introducing new distinctions. with '#' will be ignored, and an empty message aborts the commit. # # On branch corrections-clarifications # Changes to be committed: # modified: trust-meter-tech-spec.md #
members of marginalized groups more broadly, in accordance with the scope of the project and the proposed revision of the introductory sections.
collection of good, reproducible examples, as discussed with Clayton recently.
|
The following is a list of substantial components missing from the current draft that have been suggested, or at least discussed as possibilities. Thanks to Clayton Lewis for insightful conversation surrounding these ideas.
|
Markup changes to the draft document.
feat: copy official scope clause
Clarify and separately state how the scope is interpreted for purposes of the document.
Add Problem section, and revise Purpose and Scope sections.
Delete duplicative "Statistical Discrimination" section.
Corrections, additions and clarifications
Added section on statistical discrimination
| ## Definitions | ||
|
|
||
| ### Example Trained – “ET” | ||
|
|
||
| The AI tool training data is examples with predefined correct responses. A | ||
| classification AI is an example of an ET tool. | ||
|
|
||
| ### Foundation Based – “FB” | ||
|
|
||
| The AI tool is trained but not using specific examples. Instead, they are | ||
| trained on vast amounts of text or media. An example of an FB is a Large | ||
| Language Model (LLM). |
There was a problem hiding this comment.
Originally by Julia Stoyanovich on March 24, 2026
The distinction between example-trained (ET) and foundation-based (FB) AI tools in the Definitions section could be sharpened. Foundation models are also trained on examples — they learn from billions of sentences, images, or other data. The difference is that nobody hand-labeled each example with a correct answer; instead, the model picks up patterns on its own (for instance, by learning to predict the next word in a sentence). So the ET/FB distinction doesn't quite capture what makes these tools different. I'd suggest framing it instead around what the tool is designed to do. Here is a revised version of the Definitions and Kinds of AI Tools sections, combined into one section.
See changes in PR #80 (rendered markdown)
| ## Potential problems for marginalized groups | ||
|
|
||
| ### Problems with ET tools | ||
|
|
||
| These different kinds of tools present different kinds of problems, with | ||
| different possible remedies. Let’s consider first ET tools, along with FB tools | ||
| in which examples are used, either in fine tuning or in prompts. An obvious | ||
| problem arises if the examples used in creating or shaping these tools don’t | ||
| include examples that reflect the situations or needs of a diversity of people, | ||
| especially those at risk of discrimination. We’ll call this the _representation_ | ||
| problem. It’s clear that if members of marginalized groups and their | ||
| circumstances aren’t represented in the shaping of a tool, there's a real risk | ||
| that the tool will produce inappropriate responses. | ||
|
|
||
| However, even if representation is achieved, ET tools can be problematic for | ||
| people who are outliers in relevant respects. Commonly, ET tools work by | ||
| creating a mathematical model of the examples on which they are trained. This | ||
| model can’t capture all the details of the examples, but forms a simplified, | ||
| approximate picture of the examples. The training process pushes the model to | ||
| do a good job on the average, not to give the correct response on every example. | ||
| It follows that the simplified model will be more accurate on examples whose | ||
| features are common in the collection of examples, than on examples whose | ||
| features are uncommon. | ||
|
|
||
| This works against people who are already at risk of discrimination, in many | ||
| cases. Their situations are often different from the average or the norm, in | ||
| relevant respects. For example, a person with a disability may have an unusual | ||
| employment record. A model that does well in evaluating applicants with common | ||
| employment records, and so looks good on the average, may do poorly for people | ||
| with unusual records. We’ll call this the _averaging_ problem, associated with | ||
| unusual examples, namely outliers. | ||
|
|
||
| This problem is sometimes considered as the problem of _out of sample_ data. But | ||
| that’s what we’re calling the _representation_ problem. The averaging problem | ||
| can occur even for examples that are included in the training examples, that is, | ||
| for cases that are in sample, not out of sample. | ||
|
|
||
| In the literature, an outlier is often defined as an exemplar that is so | ||
| different from the other examples in a population that it must represent a | ||
| different population, or result from a different process, than the rest of the | ||
| population. In experimental data, it’s not uncommon for outliers to be excluded | ||
| from analysis, as reflecting some irrelevant failure of procedure. | ||
|
|
||
| In our situation, excluding outliers is obviously not appropriate. But | ||
| identifying them can be a method of mitigating the averaging problem, as we’ll | ||
| discuss. | ||
|
|
||
| ## Problems with FB tools | ||
|
|
||
| We consider here problems that aren’t traceable to a collection of examples that | ||
| was used to train or shape the responses of a tool, but rather to the | ||
| characteristics of FB tools. All of these problems are becoming less common as | ||
| the technology advances, but all remain issues that tool creators need to | ||
| address today. | ||
|
|
||
| ### Brittleness | ||
|
|
||
| This refers to changes in behaviour in response to small changes in inputs. It’s | ||
| also called prompt sensitivity. Traditional automation is notoriously brittle, | ||
| in that it commonly responds only to well-formed inputs. An input that is even | ||
| slightly incorrect, or slightly outside the designed scope of a traditional | ||
| tool, may not be processed at all. By contrast, FB tools may respond | ||
| appropriately to a wide range of inputs. For example, questions can often be | ||
| framed in many different ways, and be answered appropriately. However, it can | ||
| also happen that one input gets an appropriate response, and another, that seems | ||
| as if it should be equivalent, gets a different response. For example, [Wang et | ||
| al.](https://www.nature.com/articles/s41746-024-01029-4.pdf ) found that | ||
| seemingly equivalent medical questions often received different answers from FB | ||
| systems. | ||
|
|
||
| A special form of prompt sensitivity is sycophancy: FB systems will sometimes | ||
| offer answers that the user may appear to want, based on the input the user | ||
| provides. | ||
|
|
||
| ### Hallucination | ||
|
|
||
| Sometimes FB systems fabricate answers, for example by referring to sources of | ||
| information that don’t actually exist. | ||
|
|
||
| These problems obviously pose issues for all users, but they may have special | ||
| impact on people with cognitive limitations, who may be less able to detect and | ||
| correct them. | ||
|
|
||
| ### Opacity | ||
|
|
||
| Our understanding of how FB systems actually work is very limited. Although | ||
| their basic structure and operation is completely known, how an FB system | ||
| responds in any given situation is determined by a huge number of parameters, | ||
| interacting in very complex ways. This means that when a problem occurs in an FB | ||
| system, it is in general not clear how to correct it. Further training (fine | ||
| tuning) or adding material to prompts may work, but it is hard to be sure, or to | ||
| know how these corrections may affect other aspects of the system’s behavior. | ||
| This isn’t a problem for users, directly, but it is a problem for tool creators. | ||
|
|
||
| ### Variability | ||
|
|
||
| Another problem for tool creators is that different FB tools behave differently. | ||
| A tactic that works well for Claude, say adding some language to prompts, may | ||
| work differently for Gemini. This makes it difficult for tool creators to learn | ||
| from one another. | ||
|
|
||
| ### Problems of lookup | ||
|
|
||
| The issues we’ve been discussing apply to the lookup capabilities of FB tools, | ||
| as well as to the core capabilities of such tools. Will the tool frame its | ||
| lookup correctly? Will it interpret what it finds correctly? It’s hard to be | ||
| certain. |
There was a problem hiding this comment.
Originally by Julia Stoyanovich on March 24, 2026
The current text organizes Potential Problems for Marginalized Groups by tool type (ET vs. FB), but several of the problems actually cut across that boundary. For instance, representation gaps in training data affect any tool that learns from data, whether task-specific or general-purpose. Opacity is discussed under FB tools but applies equally to task-specific models. And some important problems, like generalization out of context, aren't mentioned at all. I'd suggest organizing instead by the problem itself, and then briefly noting which kinds of tools are most susceptible. This better serves the spec's goal of helping implementers anticipate harms regardless of the specific technology they are using. Here is a revised version of the Potential Problems section.
See changes in PR #80 (rendered markdown)
automation assessment
Incorporating Julia's feedback
| * Examples of harm that has resulted or may result from the application of AI | ||
| systems, especially in cases of outliers. Harms demonstrated by earlier | ||
| systems (not employing machine learning) are also of interest, in so far as | ||
| they are relevant. |
There was a problem hiding this comment.
@jasonjgw and @claytonhalllewis this news report from yesterday seems like a relevant example of harms.
…nd "Conclusion" sections.
Julia's suggestions 2026-Mar-26
…rimination Restore the Statistical Discrimination section.
These are in response to an email discussion with Clayton
| An extreme case of misrepresentation arises for **outliers**: groups or individuals so | ||
| rare in the training data that the model has too few examples to learn their patterns at | ||
| all — not enough data to even compute meaningful statistics. |
There was a problem hiding this comment.
Originally by @Jutta-Inclusive on March 27, 2026 in a comment on the Julia's Feedback Google Doc.
Disabled people tend to be an n of 1 with respect to many population data sets in the characteristics or patterns used to guide a decision. This is what we have termed "statistical discrimination" in the standard. Disability generally means difference from average as well as extreme heterogeneity compared to other protected groups.
docs: incorporating Julia's suggestions from 29 Mar 26
the draft which no longer exists.
Remove the reference in the Cover Note to an editorial comment later in the draft which no longer exists.
Fix: grammar "incorrectly labeled" instead of "incorrect labeled".
preceding material.
Editorial changes
publication formalities
[x] This isn't a duplicate of an existing pull request
Description
Making a pull request of the draft specification to allow people to add inline comments for the draft as appropriate