Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions wiki/average_entropy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
*[[Aggregation functions]] average_entropy*

## syntax

- average_entropy(*a*)
- average_entropy(*a*, *relation*)

## definition

- average_entropy(*a*) results in a [[parameter]] with the **Shannon entropy** (in bits) of the non-[[null]] values of [[attribute]] *a*.
- average_entropy(*a*, *relation*) results in an attribute with the Shannon entropy (in bits) of the non-null values of attribute *a*, grouped by *[[relation]]*. The [[domain unit]] of the resulting attribute is the [[values unit]] of the *relation*.

## description

The Shannon entropy of a set of N observations is defined as:

```
average_entropy(a) = H(a) = -∑ pᵢ · log₂(pᵢ)
```

where pᵢ = nᵢ / N is the relative frequency of each distinct non-null value and N = ∑ nᵢ is the total number of non-null observations.

This is also known as the *average* Shannon entropy, because it equals the [[entropy]] divided by N (the total count):

```
average_entropy(a) = entropy(a) / N
```

For a uniform distribution over k distinct values, `average_entropy(a)` equals `log₂(k)`.

The result is 0 when all observations have the same value (no uncertainty), or when N = 0 (empty partition).

## applies to

- attribute *a* with any scalar [[value type]]
- *relation* with value type of the group CanBeDomainUnit

## conditions

1. The domain of [[argument]] *a* and *relation* must match.

## since version

14.4.0

## example

```
parameter<float64> avgEntropyLifeStyleCode := average_entropy(City/LifeStyleCode);
// result ≈ 1.459

attribute<float64> avgEntropyLifeStyleCodePerRegion (Region) := average_entropy(City/LifeStyleCode, City/Region_rel);
```

| City/LifeStyleCode | City/Region_rel |
|-------------------:|----------------:|
| 2 | 0 |
| 0 | 1 |
| 1 | 2 |
| 0 | 1 |
| 1 | 3 |
| 1 | null |
| null | 3 |

*domain City, nr of rows = 7*

For the total: non-null values are [2, 0, 1, 0, 1, 1], so N = 6, counts: 0→2, 1→3, 2→1.
`average_entropy = -(2/6·log₂(2/6) + 3/6·log₂(3/6) + 1/6·log₂(1/6)) ≈ 1.459`

| **avgEntropyLifeStyleCodePerRegion** |
|-------------------------------------:|
| **0** |
| **0** |
| **0** |
| **0** |
| **0** |

*domain Region, nr of rows = 5*

Each region has only one unique non-null value (or no non-null data), so average_entropy = 0 for all regions. Region 3 has City 6 with null LifeStyleCode (excluded) and City 4 with LifeStyleCode=1 (only one unique value → average_entropy 0). Region 4 has no cities at all, so N=0 and average_entropy = 0.

## see also

- [[entropy]] - the total Shannon entropy (N · H), i.e. the sum of individual information contributions
- [[modus]] - the most frequently occurring value
- [[unique_count]] - number of distinct non-null values
83 changes: 83 additions & 0 deletions wiki/entropy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
*[[Aggregation functions]] entropy*

## syntax

- entropy(*a*)
- entropy(*a*, *relation*)

## definition

- entropy(*a*) results in a [[parameter]] with the **total Shannon entropy** (in bits) of the non-[[null]] values of [[attribute]] *a*.
- entropy(*a*, *relation*) results in an attribute with the total Shannon entropy (in bits) of the non-null values of attribute *a*, grouped by *[[relation]]*. The [[domain unit]] of the resulting attribute is the [[values unit]] of the *relation*.

## description

The total Shannon entropy of a set of N observations is defined as:

```
entropy(a) = N · H(a)
= -∑ nᵢ · log₂(nᵢ / N)
```

where nᵢ is the count of each distinct non-null value and N = ∑ nᵢ is the total number of non-null observations.

This equals N times the average (per-element) Shannon entropy H(a). See [[average_entropy]] for the average Shannon entropy H(a).

For a uniform distribution over k distinct values, `entropy(a)` equals `N · log₂(k)`.

The result is 0 when all observations have the same value (no uncertainty), or when N = 0 (empty partition).

## applies to

- attribute *a* with any scalar [[value type]]
- *relation* with value type of the group CanBeDomainUnit

## conditions

1. The domain of [[argument]] *a* and *relation* must match.

## since version

14.4.0

## example

```
parameter<float64> entropyLifeStyleCode := entropy(City/LifeStyleCode);
// result ≈ 8.757

attribute<float64> entropyLifeStyleCodePerRegion (Region) := entropy(City/LifeStyleCode, City/Region_rel);
```

| City/LifeStyleCode | City/Region_rel |
|-------------------:|----------------:|
| 2 | 0 |
| 0 | 1 |
| 1 | 2 |
| 0 | 1 |
| 1 | 3 |
| 1 | null |
| null | 3 |

*domain City, nr of rows = 7*

For the total: non-null values are [2, 0, 1, 0, 1, 1], so N = 6, counts: 0→2, 1→3, 2→1.
`entropy = -(2·log₂(2/6) + 3·log₂(3/6) + 1·log₂(1/6)) ≈ 8.757`

| **entropyLifeStyleCodePerRegion** |
|----------------------------------:|
| **0** |
| **0** |
| **0** |
| **0** |
| **0** |

*domain Region, nr of rows = 5*

Each region has only one unique non-null value (or no non-null data), so entropy = 0 for all regions. Region 3 has City 6 with null LifeStyleCode (excluded) and City 4 with LifeStyleCode=1 (only one unique value → entropy 0). Region 4 has no cities at all, so N=0 and entropy = 0.

## see also

- [[average_entropy]] - the Shannon entropy per element (H = entropy / N), i.e. the standard Shannon entropy formula
- [[modus]] - the most frequently occurring value
- [[unique_count]] - number of distinct non-null values
74 changes: 74 additions & 0 deletions wiki/frequency_table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
*[[Aggregation functions]] frequency_table*

## syntax

- frequency_table(*a*)
- frequency_table(*a*, *relation*)

## definition

- frequency_table(*a*) results in a [[parameter]] with a string listing all non-[[null]] values of [[attribute]] *a* together with how often each value occurs, separated by "; ".
- frequency_table(*a*, *relation*) results in an attribute with such strings, one per partition defined by *[[relation]]*. The [[domain unit]] of the resulting attribute is the [[values unit]] of the *relation*. Each partition string contains the value-count pairs for the non-null values of *a* belonging to that partition.

## description

The result per partition is a string of the form `value1: count1; value2: count2; ...`, where:

- values are listed in ascending order (the order defined by the [[values unit]] of attribute *a*),
- only values with a non-zero count are included,
- null values in *a* are **excluded** from the counts.

To include null values in the frequency table, use [[frequency_table_with_null]] instead.

## applies to

- attribute *a* with any scalar [[value type]]
- *relation* with value type of the group CanBeDomainUnit

## conditions

1. The domain of [[argument]] *a* and *relation* must match.

## since version

14.4.0

## example

```
parameter<string> freqLifeStyleCode := frequency_table(City/LifeStyleCode);
// result = "0: 2; 1: 3; 2: 1"

attribute<string> freqLifeStyleCodePerRegion (Region) := frequency_table(City/LifeStyleCode, City/Region_rel);
```

| City/LifeStyleCode | City/Region_rel |
|-------------------:|----------------:|
| 2 | 0 |
| 0 | 1 |
| 1 | 2 |
| 0 | 1 |
| 1 | 3 |
| 1 | null |
| null | 3 |

*domain City, nr of rows = 7*

| **freqLifeStyleCodePerRegion** |
|-------------------------------|
| **"2: 1"** |
| **"0: 2"** |
| **"1: 1"** |
| **"1: 1"** |
| **""** |

*domain Region, nr of rows = 5*

City 6 (LifeStyleCode = null) is excluded. City 5 (Region_rel = null) is excluded from all groups.

## see also

- [[frequency_table_with_null]] - variant that includes null values of *a* in the frequency table
- [[as_unique_list]] - like frequency_table but only lists the distinct values, without the counts
- [[modus]] - returns only the most frequently occurring value
- [[unique_count]] - returns the number of distinct non-null values
74 changes: 74 additions & 0 deletions wiki/frequency_table_with_null.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
*[[Aggregation functions]] frequency_table_with_null*

## syntax

- frequency_table_with_null(*a*)
- frequency_table_with_null(*a*, *relation*)

## definition

- frequency_table_with_null(*a*) results in a [[parameter]] with a string listing **all** values of [[attribute]] *a* — including [[null]] values — together with how often each value occurs, separated by "; ".
- frequency_table_with_null(*a*, *relation*) results in an attribute with such strings, one per partition defined by *[[relation]]*. The [[domain unit]] of the resulting attribute is the [[values unit]] of the *relation*. Each partition string contains the value-count pairs for all values of *a* (including null) belonging to that partition.

## description

The result per partition is a string of the form `value1: count1; value2: count2; ...`, where:

- values are listed in ascending order (the order defined by the [[values unit]] of attribute *a*),
- only values with a non-zero count are included,
- null values in *a* are **included** in the frequency table and are shown as `<null>: count`.

This function is identical to [[frequency_table]] except that null values in *a* are counted and included in the result string. Elements mapped to a null partition (null *relation* value) are still excluded from all groups.

## applies to

- attribute *a* with any scalar [[value type]]
- *relation* with value type of the group CanBeDomainUnit

## conditions

1. The domain of [[argument]] *a* and *relation* must match.

## since version

14.4.0

## example

```
parameter<string> freqLifeStyleCodeWithNull := frequency_table_with_null(City/LifeStyleCode);
// result = "0: 2; 1: 3; 2: 1; <null>: 1"

attribute<string> freqLifeStyleCodeWithNullPerRegion (Region) := frequency_table_with_null(City/LifeStyleCode, City/Region_rel);
```

| City/LifeStyleCode | City/Region_rel |
|-------------------:|----------------:|
| 2 | 0 |
| 0 | 1 |
| 1 | 2 |
| 0 | 1 |
| 1 | 3 |
| 1 | null |
| null | 3 |

*domain City, nr of rows = 7*

| **freqLifeStyleCodeWithNullPerRegion** |
|---------------------------------------|
| **"2: 1"** |
| **"0: 2"** |
| **"1: 1"** |
| **"1: 1; &lt;null&gt;: 1"** |
| **""** |

*domain Region, nr of rows = 5*

City 6 (LifeStyleCode = null, Region_rel = 3) is included in Region 3's count as `<null>: 1`. City 5 (Region_rel = null) is excluded from all groups.

## see also

- [[frequency_table]] - variant that excludes null values of *a* from the frequency table
- [[as_unique_list]] - like frequency_table but only lists the distinct values, without the counts
- [[modus]] - returns only the most frequently occurring value
- [[unique_count]] - returns the number of distinct non-null values