Skip to content

Working draft of the technical specification#60

Merged
jobara merged 69 commits intomainfrom
draft
Mar 31, 2026
Merged

Working draft of the technical specification#60
jobara merged 69 commits intomainfrom
draft

Conversation

@klown
Copy link
Copy Markdown
Contributor

@klown klown commented Mar 5, 2026

[x] This isn't a duplicate of an existing pull request

Description

Making a pull request of the draft specification to allow people to add inline comments for the draft as appropriate

feat: create markdown version of the trust meter introduction google doc
jasonjgw and others added 4 commits March 6, 2026 07:22
section.

The Problem section introduces the issues associated with outliers.

The Purpose section is revised to reflect the scope of the document (it
applies to outliers - or at least those who are members of marginalized
groups, not just those with disabilities).

The Scope section is revised to include systems that serve an advisory
role, such as have been discussed by the Committee.
Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +13 to +14
There may not yet be many of these for AI systems, but with careful presentation
we can likely use some from older systems. (2) it would be great to have
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by Jutta on January 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

We have been collecting examples from harm and incident databases that are moderated and verified.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 9, 2026 in a comment on the Trust Meter Introduction Google Doc.

We still don't have a good survey of potential harms, partly because there
isn't (or at least, I don't have) a suitable variety of example applications.
The harms caused are, I assume, related to the purposes for which systems may
be used. For example, the risks associated with a tool for recommending
administrative decisions are different from those of a system that assists in
research or summarizes documentation.

It also isn't obvious that the harms of under-representation are different
from those of biased representation, particularly if the effects on a system's
behavior are indistinguishable.

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +29 to +30
The purpose of this note is to outline some of the potential problems for people
with disabilities and those who support them posed by AI tools, and approaches
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

as well as other individuals and groups who are statistical outliers for purposes of the relevant AI application. We'll need a section that discusses and provides representative examples of outliers. People may have outlier status due to protected characteristics (race/ethnicity, gender, disability, socio-economic position, etc.) alone or in various combinations, i.e., intersectionality. Presumably they may be outliers simply in virtue of being different from most of the population with respect to a variable relevant to the AI application, even if unrelated to a characteristic that typically give rise to discriminatory treatment in society.

Comment thread trust-meter-tech-spec.md Outdated
to mitigating these problems. The focus is on problems that arise from people
with disabilities being different, or represented as being different, from other
people, rather than on the problem of bias, that arises when AI tools may embed
biased attitudes about people with disabilities.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by Jutta on January 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

or because people are under-represented in the data.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 9, 2026 in a comment on the Trust Meter Introduction Google Doc.

As a preliminary
questions of scope, however, suppose that a model of either type derives very
little information from its training corpus about people with disabilities,
and what it does learn tends to reflect stereotypes. This is a combination of
biased information (not our topic, if I understand correctly) and limited
information (the people in question are greatly under-represented in the
corpus), hence the two problems compound. Thus it might be worth acknowledging
an interaction between the two issues, even though biased information (by
contrast with missing information and outlier status) isn't our central topic
here.

Comment thread trust-meter-tech-spec.md Outdated
- provides foundational guidance to support adoption but does not prescribe
conformance requirements

## Statistical Discrimination
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @vr619536 on February 18, 2026 in a comment on the Trust Meter Introduction Google Doc.

will want to expand on these. In part wanting to connect the doc to the SOW and original TOR

Comment thread trust-meter-tech-spec.md
Comment on lines +53 to +54
- The technical specification applies to machine‑learning‑based classification
systems used in decision‑making.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

Does this include systems providing advice or information that may be used in making decisions? For example, AI Answers isn't used directly in decision-making, but the information and references it offers may influence subsequent decisions.

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +345 to +348
The people who have the most stake in the correct operation of a system are
often the people whose cases are being handled. There should be an easy feedback
system in place so that users can speak up when they feel the system has not
handled their case correctly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

Moreover, if a system is used in decision-making, standards of procedural fairness (e.g., administrative law requirements) may necessitate the availability of procedures by which those affected can challenge decisions and invoke additional human review.

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +363 to +364
For ET systems ##can someone look in the literature for how one can improve a
classifier system?##
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @klown on February 4, 2026 in a comment on the Trust Meter Introduction Google Doc.

I started to look into this, and found a high-level overview article entitled "What to Do When Your Classification Model Isn’t Performing Well". It lists approximately 17 techniques across 7 broad categories. The categories are performance metrics, data quality, feature performance, model architecture and hyper-parameters, over- and under-fitting, cross-validation, and ensemble methods. This is a rather large landscape of things to consider. I can't tell which are the important ones in this context.

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +371 to +374
Another approach that could be tried would be adding mishandled cases, with the
proper handling described, to future prompts. Modern FB systems support very
long prompts, so a good deal of corrective information could be added in this
way.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

with the qualification that the system may not reliably act on this guidance as desired.

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +376 to +378
As with other aspects of these systems, these improvement processes need to be
monitored and checked. For example, adding a case with its proper processing
does not guarantee that the system will respond correctly to future cases.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

Is there a risk that improving a system's performance in some cases may effectively reduce its performance in others? Does this necessitate re-testing of a large set of cases before deploying an update?

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +386 to +388
called for. These same problems beset human-provided services, too. Can we
harness AI, with its flexibility, and ability to process vast amounts of
information, to do better?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @jasonjgw on February 26, 2026 in a comment on the Trust Meter Introduction Google Doc.

I think (as Clayton notes informally in the cover letter at the start of the text)a case can be made for adding a section to the document that describes and exemplifies the harms which systems may cause, directly or indirectly. Guidance could be given on how to decide the threshold question of whether to deploy a machine learning-based system for a given purpose, in light of the potential harms it may cause and the risks of alternative (including manual or other conventional technological means) of achieving the same task objectives. Reference may need to be made in this connection to AI ethics literature. In addition, should a summary section be added to the document that captures the considerations which should be taken into account at each stage of a system's development and use (e.g., design, implementation, pre-operational testing, deployment, monitoring and update)?

under-represented and misrepresented in data provided to AI systems.
@jasonjgw
Copy link
Copy Markdown
Contributor

jasonjgw commented Mar 9, 2026

Section 3 ("Statistical Discrimination") seems duplicative of the material in the preceding section. Maybe it should be deleted, or perhaps rewritten with new material.

@jasonjgw
Copy link
Copy Markdown
Contributor

jasonjgw commented Mar 9, 2026

If appropriate in a Canadian standards committee draft, I think the Cover Note should be revised as a request for comments from reviewers, emphasizing the lacunae in the circulated version of the document. The means provided for submitting review comments (email, GitHub, and any others) should be specified.

I can prepare a PR if this approach is considered desirable.

jasonjgw added 5 commits March 9, 2026 19:33
and amplifying the preceding definitions in this section rather than
introducing new distinctions.

with '#' will be ignored, and an empty message aborts the commit.  # #
On branch corrections-clarifications # Changes to be committed: #
modified: trust-meter-tech-spec.md #
members of marginalized groups more broadly, in accordance with the
scope of the project and the proposed revision of the introductory
sections.
collection of good, reproducible examples, as discussed with Clayton
recently.
@jasonjgw
Copy link
Copy Markdown
Contributor

The following is a list of substantial components missing from the current draft that have been suggested, or at least discussed as possibilities. Thanks to Clayton Lewis for insightful conversation surrounding these ideas.

  • Actual and hypothetical examples of harms caused by AI systems due at least in part to outliers. This should also provide examples of people whose interests are at risk. Currently, there are very few examples in the draft, hence the call for more examples in the Cover Note.
  • A possible discussion of how to integrate the proposed mitigation strategies at various stages of a system's development, deployment, revision, and monitoring.
  • A discussion of how to decide whether or not to deploy an AI system for specific purposes, given the discrimination risk. This includes balancing potential harms against the possibly different harms that may result from applying conventional (non-AI/ML) approaches to the same problem.

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +95 to +106
## Definitions

### Example Trained – “ET”

The AI tool training data is examples with predefined correct responses. A
classification AI is an example of an ET tool.

### Foundation Based – “FB”

The AI tool is trained but not using specific examples. Instead, they are
trained on vast amounts of text or media. An example of an FB is a Large
Language Model (LLM).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by Julia Stoyanovich on March 24, 2026

The distinction between example-trained (ET) and foundation-based (FB) AI tools in the Definitions section could be sharpened. Foundation models are also trained on examples — they learn from billions of sentences, images, or other data. The difference is that nobody hand-labeled each example with a correct answer; instead, the model picks up patterns on its own (for instance, by learning to predict the next word in a sentence). So the ET/FB distinction doesn't quite capture what makes these tools different. I'd suggest framing it instead around what the tool is designed to do. Here is a revised version of the Definitions and Kinds of AI Tools sections, combined into one section.

See changes in PR #80 (rendered markdown)

Comment thread trust-meter-tech-spec.md Outdated
Comment on lines +179 to +285
## Potential problems for marginalized groups

### Problems with ET tools

These different kinds of tools present different kinds of problems, with
different possible remedies. Let’s consider first ET tools, along with FB tools
in which examples are used, either in fine tuning or in prompts. An obvious
problem arises if the examples used in creating or shaping these tools don’t
include examples that reflect the situations or needs of a diversity of people,
especially those at risk of discrimination. We’ll call this the _representation_
problem. It’s clear that if members of marginalized groups and their
circumstances aren’t represented in the shaping of a tool, there's a real risk
that the tool will produce inappropriate responses.

However, even if representation is achieved, ET tools can be problematic for
people who are outliers in relevant respects. Commonly, ET tools work by
creating a mathematical model of the examples on which they are trained. This
model can’t capture all the details of the examples, but forms a simplified,
approximate picture of the examples. The training process pushes the model to
do a good job on the average, not to give the correct response on every example.
It follows that the simplified model will be more accurate on examples whose
features are common in the collection of examples, than on examples whose
features are uncommon.

This works against people who are already at risk of discrimination, in many
cases. Their situations are often different from the average or the norm, in
relevant respects. For example, a person with a disability may have an unusual
employment record. A model that does well in evaluating applicants with common
employment records, and so looks good on the average, may do poorly for people
with unusual records. We’ll call this the _averaging_ problem, associated with
unusual examples, namely outliers.

This problem is sometimes considered as the problem of _out of sample_ data. But
that’s what we’re calling the _representation_ problem. The averaging problem
can occur even for examples that are included in the training examples, that is,
for cases that are in sample, not out of sample.

In the literature, an outlier is often defined as an exemplar that is so
different from the other examples in a population that it must represent a
different population, or result from a different process, than the rest of the
population. In experimental data, it’s not uncommon for outliers to be excluded
from analysis, as reflecting some irrelevant failure of procedure.

In our situation, excluding outliers is obviously not appropriate. But
identifying them can be a method of mitigating the averaging problem, as we’ll
discuss.

## Problems with FB tools

We consider here problems that aren’t traceable to a collection of examples that
was used to train or shape the responses of a tool, but rather to the
characteristics of FB tools. All of these problems are becoming less common as
the technology advances, but all remain issues that tool creators need to
address today.

### Brittleness

This refers to changes in behaviour in response to small changes in inputs. It’s
also called prompt sensitivity. Traditional automation is notoriously brittle,
in that it commonly responds only to well-formed inputs. An input that is even
slightly incorrect, or slightly outside the designed scope of a traditional
tool, may not be processed at all. By contrast, FB tools may respond
appropriately to a wide range of inputs. For example, questions can often be
framed in many different ways, and be answered appropriately. However, it can
also happen that one input gets an appropriate response, and another, that seems
as if it should be equivalent, gets a different response. For example, [Wang et
al.](https://www.nature.com/articles/s41746-024-01029-4.pdf ) found that
seemingly equivalent medical questions often received different answers from FB
systems.

A special form of prompt sensitivity is sycophancy: FB systems will sometimes
offer answers that the user may appear to want, based on the input the user
provides.

### Hallucination

Sometimes FB systems fabricate answers, for example by referring to sources of
information that don’t actually exist.

These problems obviously pose issues for all users, but they may have special
impact on people with cognitive limitations, who may be less able to detect and
correct them.

### Opacity

Our understanding of how FB systems actually work is very limited. Although
their basic structure and operation is completely known, how an FB system
responds in any given situation is determined by a huge number of parameters,
interacting in very complex ways. This means that when a problem occurs in an FB
system, it is in general not clear how to correct it. Further training (fine
tuning) or adding material to prompts may work, but it is hard to be sure, or to
know how these corrections may affect other aspects of the system’s behavior.
This isn’t a problem for users, directly, but it is a problem for tool creators.

### Variability

Another problem for tool creators is that different FB tools behave differently.
A tactic that works well for Claude, say adding some language to prompts, may
work differently for Gemini. This makes it difficult for tool creators to learn
from one another.

### Problems of lookup

The issues we’ve been discussing apply to the lookup capabilities of FB tools,
as well as to the core capabilities of such tools. Will the tool frame its
lookup correctly? Will it interpret what it finds correctly? It’s hard to be
certain.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by Julia Stoyanovich on March 24, 2026

The current text organizes Potential Problems for Marginalized Groups by tool type (ET vs. FB), but several of the problems actually cut across that boundary. For instance, representation gaps in training data affect any tool that learns from data, whether task-specific or general-purpose. Opacity is discussed under FB tools but applies equally to task-specific models. And some important problems, like generalization out of context, aren't mentioned at all. I'd suggest organizing instead by the problem itself, and then briefly noting which kinds of tools are most susceptible. This better serves the spec's goal of helping implementers anticipate harms regardless of the specific technology they are using. Here is a revised version of the Potential Problems section.

See changes in PR #80 (rendered markdown)

Comment thread trust-meter-tech-spec.md
Comment on lines +11 to +14
* Examples of harm that has resulted or may result from the application of AI
systems, especially in cases of outliers. Harms demonstrated by earlier
systems (not employing machine learning) are also of interest, in so far as
they are relevant.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread trust-meter-tech-spec.md
Comment on lines +206 to +208
An extreme case of misrepresentation arises for **outliers**: groups or individuals so
rare in the training data that the model has too few examples to learn their patterns at
all — not enough data to even compute meaningful statistics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally by @Jutta-Inclusive on March 27, 2026 in a comment on the Julia's Feedback Google Doc.

Disabled people tend to be an n of 1 with respect to many population data sets in the characteristics or patterns used to guide a decision. This is what we have termed "statistical discrimination" in the standard. Disability generally means difference from average as well as extreme heterogeneity compared to other protected groups.

jasonjgw and others added 7 commits March 30, 2026 08:53
docs: incorporating Julia's suggestions from 29 Mar 26
Remove the reference in the Cover Note to an editorial comment later in the draft which no longer exists.
Fix: grammar "incorrectly labeled" instead of "incorrect labeled".
@jobara jobara merged commit ec59d90 into main Mar 31, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants