Skip to content

Commit abc18f3

Browse files
committed
code and documentation improvements
1 parent 5f83d4c commit abc18f3

22 files changed

+632
-599
lines changed

LICENSE.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Barry Drake
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

docs/1_01_front_matter.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# _front matter_
22

3-
Synthorus version 0.0.0a9, built 2025-11-19 20:28:49 (AUS Eastern Summer Time).
3+
Synthorus version 0.0.0a10, built 2025-11-20 08:35:07 (AUS Eastern Summer Time).
44
A pre-release version.
55

66
These pages form a reference for the software known as Synthorus.
@@ -9,7 +9,7 @@ They are intended for users and developers of the software to help orient them
99
to key design points and functionality of the software.
1010

1111
Not all sections of this documentation will be relevant for all readers. As well as covering top-level
12-
organization and usage of Synthorus, these notebooks also covers some important implementation details.
12+
organization and usage of Synthorus, these notebooks also cover some important implementation details.
1313

1414
This documentation is available online at
1515
[synthorus.readthedocs.io](https://synthorus.readthedocs.io/).

docs/1_01_front_matter_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ They are intended for users and developers of the software to help orient them
99
to key design points and functionality of the software.
1010

1111
Not all sections of this documentation will be relevant for all readers. As well as covering top-level
12-
organization and usage of Synthorus, these notebooks also covers some important implementation details.
12+
organization and usage of Synthorus, these notebooks also cover some important implementation details.
1313

1414
This documentation is available online at
1515
[synthorus.readthedocs.io](https://synthorus.readthedocs.io/).

docs/1_03_installation.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,10 @@
3737
"start_time": "2025-11-10T01:56:31.203505Z"
3838
},
3939
"execution": {
40-
"iopub.execute_input": "2025-11-19T09:29:04.416172Z",
41-
"iopub.status.busy": "2025-11-19T09:29:04.415200Z",
42-
"iopub.status.idle": "2025-11-19T09:29:04.617235Z",
43-
"shell.execute_reply": "2025-11-19T09:29:04.617235Z"
40+
"iopub.execute_input": "2025-11-19T21:35:22.665359Z",
41+
"iopub.status.busy": "2025-11-19T21:35:22.665359Z",
42+
"iopub.status.idle": "2025-11-19T21:35:22.865666Z",
43+
"shell.execute_reply": "2025-11-19T21:35:22.865666Z"
4444
}
4545
},
4646
"outputs": [
@@ -51,10 +51,10 @@
5151
"Entity: patient ['_id_', '_count_', 'age']\n",
5252
"\n",
5353
"patient [('_id_', 1), ('_count_', 1), ('age', 'old')]\n",
54-
"patient [('_id_', 2), ('_count_', 2), ('age', 'young')]\n",
54+
"patient [('_id_', 2), ('_count_', 2), ('age', 'middle_aged')]\n",
5555
"patient [('_id_', 3), ('_count_', 3), ('age', 'old')]\n",
5656
"patient [('_id_', 4), ('_count_', 4), ('age', 'old')]\n",
57-
"patient [('_id_', 5), ('_count_', 5), ('age', 'old')]\n",
57+
"patient [('_id_', 5), ('_count_', 5), ('age', 'young')]\n",
5858
"\n",
5959
"Finished\n"
6060
]

docs/2_01_background.ipynb

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"- synthetic data\n",
1616
"- random variables\n",
1717
"- probabilistic graphical models\n",
18-
"- datasets and data sources\n",
18+
"- datasets and datasources\n",
1919
"- cross-tables\n",
2020
"- protecting privacy\n",
2121
"- entities and relationships\n",
@@ -134,7 +134,7 @@
134134
"cell_type": "markdown",
135135
"metadata": {},
136136
"source": [
137-
"## Datasets and Data Sources ##\n",
137+
"## Datasets and Datasources ##\n",
138138
"\n",
139139
"Within Synthorus a _dataset_ represents reference data from a system (the system to represent with synthetic data).\n",
140140
"Although in-principal synthetic data can be used as a dataset, we generally reserve _dataset_ to mean reference data for training PGMs, and reserve _synthetic data_\n",
@@ -149,11 +149,16 @@
149149
"\n",
150150
"Within Synthorus dataset values and weights are immutable and cannot be updated.\n",
151151
"\n",
152-
"A _data source_ is merely a specification for how to create a dataset. Example data sources include:\n",
153-
"- a string in CSV format\n",
154-
"- a parquet file\n",
155-
"- a database query\n",
156-
"- a function.\n"
152+
"A _dataset spec_ is a specification for how to access a dataset. Synthorus provides dataset spec classes for different kinds of datasets, including:\n",
153+
"- CSV file or inline text (with generalised separators and line spacing)\n",
154+
"- Table Builder file or inline text (as created by the [Australian Bureau of Statistics](https://www.abs.gov.au/statistics/microdata-tablebuilder/tablebuilder))\n",
155+
"- Pickled Pandas DataFrame object, as a file\n",
156+
"- Parquet file\n",
157+
"- Feather file\n",
158+
"- A database SQL query (using ODBC or Postgres)\n",
159+
"- A mathematical function (with a defined domain).\n",
160+
"\n",
161+
"A _datasource spec_ (or just _datasource_) is composed of a dataset spec and other parameters. Those parameters support privacy protection and the statistical integrity of data.\n"
157162
]
158163
},
159164
{
@@ -280,7 +285,7 @@
280285
"\n",
281286
"In data modeling, relationship cardinality describes how many instances of one entity can be related to instances of another entity. In general, each side of the relationship cardinality is a range. E.g., the cardinality of entities X and Y might be _a_-_b_:_c_-_d_, meaning that for any instance of Y the minimum number of X instances is _a_ and the maximum is _b_. Similarly, for any instance of X the minimum number of Y instances is _c_ and the maximum is _d_. If the minimum and maximum of a range is the same value, then just that value is written.\n",
282287
"\n",
283-
"If a range can be any value, then an asterisk (\\*) can be used,. If the minimum of a range is greater than zero, then only the upper value is marked with an asterisk.\n",
288+
"If a range can be any value, then an asterisk (\\*) can be used. If the minimum of a range is greater than zero, then only the upper value is marked with an asterisk.\n",
284289
"\n",
285290
"Here are some common examples.\n",
286291
"\n",
@@ -429,7 +434,7 @@
429434
"\n",
430435
"For example, the utility of a dataset may depend only on the statistical relationships between small subsets of random variables. Or it may only be concerned with marginal distributions of random variables, or correlations between pairs of random variables.\n",
431436
"\n",
432-
"Synthorus avoids the problem of needing large numbers of synthetic data records by using the joint probability distribution from the probabilistic models of the simulator. The probabilistic models guarantee that as the number of synthetic data records increases, the emprical probability distribution approaches the models distribution. This has three advantages.\n",
437+
"Synthorus avoids the problem of needing large numbers of synthetic data records by using the joint probability distribution from the probabilistic models of the simulator. The probabilistic models guarantee that as the number of synthetic data records increases, the empirical probability distribution approaches the models distribution. This has three advantages.\n",
433438
"1. It removes effects of sampling errors in the evaluation of utility.\n",
434439
"2. Evaluation can be performed without any synthetic data being generated.\n",
435440
"3. Synthorus probabilistic models can be directly and efficiently queried for most standard probabilities and statistics.\n",

docs/2_02_workflow.ipynb

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,14 @@
3131
"Making a model involves providing a model specification. A model specification describes:\n",
3232
"- model meta-data\n",
3333
"- privacy protection and other parameter values\n",
34-
"- data sources for reference data\n",
34+
"- datasources for reference data\n",
3535
"- what are the random variables and their states\n",
3636
"- what are the cross-tables\n",
3737
"- what are the entities, their fields, and relationships\n",
3838
"- simulation parameters and values.\n",
3939
"\n",
4040
"The steps to make a model are as follows.\n",
41-
"1. Datasets specified in the model specification will be loaded based on the specified data sources.\n",
41+
"1. Datasets specified in the model specification will be loaded based on the specified datasources.\n",
4242
"2. Clean cross-tables are computed.\n",
4343
"3. Privacy protection is applied to create noisy cross-tables.\n",
4444
"4. The noisy cross-tables are used to create a probabilistic graphical model (PGM) for each entity.\n",
@@ -47,7 +47,7 @@
4747
"The result of making a model is a collection of files which are placed into a model definition folder. The files are:\n",
4848
"1. a JSON definition of the model ('model_spec.json')\n",
4949
"2. a JSON definition of an index of model components ('model_index.json')\n",
50-
"2. a JSON definition of the synthetic data simulator ('simulator_spec.json')\n",
50+
"3. a JSON definition of the synthetic data simulator ('simulator_spec.json')\n",
5151
"4. Compiled Knowledge PGMs for each entity ('pgms/{_entity_}.py')\n",
5252
"5. noisy cross-tables, if requested for saving ('noisy_cross_tables/{_cross_table_}.pk')\n",
5353
"6. clean cross-tables, if requested for saving ('clean_cross_tables/{_cross_table_}.pk')\n",
@@ -121,10 +121,10 @@
121121
"start_time": "2025-11-11T03:49:36.477856Z"
122122
},
123123
"execution": {
124-
"iopub.execute_input": "2025-11-19T09:29:28.778980Z",
125-
"iopub.status.busy": "2025-11-19T09:29:28.778980Z",
126-
"iopub.status.idle": "2025-11-19T09:29:28.788590Z",
127-
"shell.execute_reply": "2025-11-19T09:29:28.788590Z"
124+
"iopub.execute_input": "2025-11-19T21:35:50.091245Z",
125+
"iopub.status.busy": "2025-11-19T21:35:50.091245Z",
126+
"iopub.status.idle": "2025-11-19T21:35:50.100528Z",
127+
"shell.execute_reply": "2025-11-19T21:35:50.100528Z"
128128
}
129129
},
130130
"outputs": [
@@ -190,10 +190,10 @@
190190
"start_time": "2025-11-11T03:49:36.499060Z"
191191
},
192192
"execution": {
193-
"iopub.execute_input": "2025-11-19T09:29:28.790595Z",
194-
"iopub.status.busy": "2025-11-19T09:29:28.790595Z",
195-
"iopub.status.idle": "2025-11-19T09:29:29.013056Z",
196-
"shell.execute_reply": "2025-11-19T09:29:29.013056Z"
193+
"iopub.execute_input": "2025-11-19T21:35:50.102532Z",
194+
"iopub.status.busy": "2025-11-19T21:35:50.102532Z",
195+
"iopub.status.idle": "2025-11-19T21:35:50.320792Z",
196+
"shell.execute_reply": "2025-11-19T21:35:50.320792Z"
197197
}
198198
},
199199
"outputs": [
@@ -233,23 +233,23 @@
233233
" }\n",
234234
" },\n",
235235
" \"rvs\": {\n",
236-
" \"X\": {\n",
236+
" \"Z\": {\n",
237237
" \"states\": \"infer_distinct\",\n",
238238
" \"ensure_none\": false\n",
239239
" },\n",
240240
" \"Y\": {\n",
241241
" \"states\": \"infer_distinct\",\n",
242242
" \"ensure_none\": false\n",
243243
" },\n",
244-
" \"Z\": {\n",
244+
" \"X\": {\n",
245245
" \"states\": \"infer_distinct\",\n",
246246
" \"ensure_none\": false\n",
247247
" }\n",
248248
" },\n",
249249
" \"crosstabs\": {\n",
250-
" \"_X\": {\n",
250+
" \"_Z\": {\n",
251251
" \"rvs\": [\n",
252-
" \"X\"\n",
252+
" \"Z\"\n",
253253
" ],\n",
254254
" \"datasource\": \"xyz\",\n",
255255
" \"epsilon\": 0.1,\n",
@@ -265,9 +265,9 @@
265265
" \"min_cell_size\": 0.0,\n",
266266
" \"max_add_rows\": 1000000\n",
267267
" },\n",
268-
" \"_Z\": {\n",
268+
" \"_X\": {\n",
269269
" \"rvs\": [\n",
270-
" \"Z\"\n",
270+
" \"X\"\n",
271271
" ],\n",
272272
" \"datasource\": \"xyz\",\n",
273273
" \"epsilon\": 0.1,\n",
@@ -281,17 +281,17 @@
281281
" \"count_field_name\": \"_count_\",\n",
282282
" \"foreign_field_name\": null,\n",
283283
" \"fields\": {\n",
284-
" \"X\": {\n",
284+
" \"Z\": {\n",
285285
" \"type\": \"sample\",\n",
286-
" \"rv_name\": \"X\"\n",
286+
" \"rv_name\": \"Z\"\n",
287287
" },\n",
288288
" \"Y\": {\n",
289289
" \"type\": \"sample\",\n",
290290
" \"rv_name\": \"Y\"\n",
291291
" },\n",
292-
" \"Z\": {\n",
292+
" \"X\": {\n",
293293
" \"type\": \"sample\",\n",
294-
" \"rv_name\": \"Z\"\n",
294+
" \"rv_name\": \"X\"\n",
295295
" }\n",
296296
" },\n",
297297
" \"cardinality\": [],\n",
@@ -320,7 +320,7 @@
320320
"source": [
321321
"This spec file defines one datasource, \"xyz\" that includes three random variables, \"X\", \"Y\" and \"Z\".\n",
322322
"\n",
323-
"System random variable can be explicitly defined in a spec file using an \"rvs\" section. The demo `spec_tiny.py` does not include this section so one is internally created with a random variable defined for all random variables seen in all data sources. Each random variable definition needs to define the possible states of the random variable. Including `states: infer_distinct` at the top level of spec file dictionary means that `states: infer_distinct` will be inherited for every random variable definition. The value `infer_distinct` means that the possible values of a random variable will be defined as \"all distinct values seen for that random variable in the datasources.\"\n",
323+
"System random variables can be explicitly defined in a spec file using section \"rvs\". The demo `spec_tiny.py` does not include this section so one is internally created with a random variable defined for all random variables seen in all datasources. Each random variable definition needs to define the possible states of the random variable. Including `states: infer_distinct` at the top level of spec file dictionary means that `states: infer_distinct` will be inherited for every random variable definition. The value `infer_distinct` means that the possible values of a random variable will be defined as \"all distinct values seen for that random variable in the datasources.\"\n",
324324
"\n"
325325
]
326326
},
@@ -354,10 +354,10 @@
354354
"start_time": "2025-11-11T03:49:37.115213Z"
355355
},
356356
"execution": {
357-
"iopub.execute_input": "2025-11-19T09:29:29.015061Z",
358-
"iopub.status.busy": "2025-11-19T09:29:29.015061Z",
359-
"iopub.status.idle": "2025-11-19T09:29:29.019192Z",
360-
"shell.execute_reply": "2025-11-19T09:29:29.019192Z"
357+
"iopub.execute_input": "2025-11-19T21:35:50.322911Z",
358+
"iopub.status.busy": "2025-11-19T21:35:50.322911Z",
359+
"iopub.status.idle": "2025-11-19T21:35:50.327511Z",
360+
"shell.execute_reply": "2025-11-19T21:35:50.327511Z"
361361
}
362362
},
363363
"outputs": [
@@ -394,10 +394,10 @@
394394
"start_time": "2025-11-11T03:49:37.127892Z"
395395
},
396396
"execution": {
397-
"iopub.execute_input": "2025-11-19T09:29:29.021198Z",
398-
"iopub.status.busy": "2025-11-19T09:29:29.021198Z",
399-
"iopub.status.idle": "2025-11-19T09:29:29.686709Z",
400-
"shell.execute_reply": "2025-11-19T09:29:29.686709Z"
397+
"iopub.execute_input": "2025-11-19T21:35:50.329516Z",
398+
"iopub.status.busy": "2025-11-19T21:35:50.329516Z",
399+
"iopub.status.idle": "2025-11-19T21:35:50.992145Z",
400+
"shell.execute_reply": "2025-11-19T21:35:50.992145Z"
401401
}
402402
},
403403
"outputs": [

0 commit comments

Comments
 (0)