Skip to content

Commit 0cf438d

Browse files
committed
add ability to have start and stop dates
* allows for a check of a single week * continues to support processing a month at a time * expands support for controlling function through .env file * provides example .env file
1 parent 597842a commit 0cf438d

File tree

3 files changed

+294
-61
lines changed

3 files changed

+294
-61
lines changed
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
MONGO_CONNECTION_STRING="mongodb://localhost:27017/"
2+
BASE_AZURE_BLOB_URL = "https://storageaccount.blob.core.windows.net/container_name"
3+
OUTPUT_FILE = "invalid-data.json"
4+
# START_DATE = "2024-06-21"
5+
# END_DATE = "2024-06-28"
6+
START_MONTH = str(os.environ.get("START_MONTH", "2024-06"))
7+
END_MONTH = str(os.environ.get("END_MONTH", "2024-06"))
8+
MAX_DOCS = 500
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# analyze_data_synchronization tool
2+
3+
This script is used to quantify the level of out-of-sync data between the Cosmos DB and the production-definitions data (source of truth).
4+
It is a diagnostic tool intended to be run on localhost if a problem is suspected. It is not run on a regular basis, at least at this
5+
time.
6+
7+
## Usage
8+
9+
### Prerequisites
10+
11+
Set up environment variables that drive how the tool runs. This can be set as system env vars. They can also be set in a `.env` You can
12+
rename `.env-example` to `.env` and modify as desired.
13+
14+
- MONGO_CONNECTION_STRING (required) - the connection string to the MongoDB database
15+
- BASE_AZURE_BLOB_URL (required) - the base path including the container
16+
- START_DATE (optional) - the first date to include in the query (default: `""`)
17+
- END_DATE (optional) - the last date to include in the query (default: `""`)
18+
- START_MONTH (optional) - the first month to include in the query (default: `"2024-01"`)
19+
- END_MONTH (optional) - the last month to include in the query (default: `"2024-06"`)
20+
- MAX_DOCS (optional) - the max number of documents that will be processed for each month or during the custom date range (default: 5000)
21+
- OUTPUT_FILE (optional) - the file to write the output to (default: `"invalid_data.json"`)
22+
23+
_NOTE: Limiting MAX_DOCS to no more than 5000 allows the script to complete in a reasonable length of time and is a
24+
sample of sufficient size to provide an understanding of the scope of the problem._
25+
26+
### Set up virtual environment
27+
28+
This is best run in a Python virtual environment. Set up the .venv and install the required dependencies.
29+
30+
```bash
31+
python3 -m venv .venv
32+
source .venv/bin/activate
33+
python3 -m pip install -r requirements.txt
34+
```
35+
36+
### Run the script
37+
38+
```bash
39+
python3 analyze.py
40+
```
41+
42+
## Example
43+
44+
### Example coordinates
45+
46+
```text
47+
composer/packagist/00f100/fcphp-cache/revision/0.1.0.json
48+
```
49+
50+
### Example Mongo document with unused fields removed
51+
52+
```json
53+
{
54+
"_id": "composer/packagist/00f100/fcphp-cache/0.1.0",
55+
"_meta": {
56+
"schemaVersion": "1.6.1",
57+
"updated": "2019-08-29T02:06:54.498Z"
58+
},
59+
"coordinates": {
60+
"type": "composer",
61+
"provider": "packagist",
62+
"namespace": "00f100",
63+
"name": "fcphp-cache",
64+
"revision": "0.1.0"
65+
},
66+
"licensed": {
67+
"declared": "MIT",
68+
"toolScore": {
69+
"total": 17,
70+
"declared": 0,
71+
"discovered": 2,
72+
"consistency": 0,
73+
"spdx": 0,
74+
"texts": 15
75+
},
76+
"score": {
77+
"total": 17,
78+
"declared": 0,
79+
"discovered": 2,
80+
"consistency": 0,
81+
"spdx": 0,
82+
"texts": 15
83+
}
84+
}
85+
}
86+
```
87+
88+
### Example Output
89+
90+
The following shows the summary stats and an example of one of the invalid samples. The actual results will contain
91+
all the invalid samples.
92+
93+
```json
94+
{
95+
"2024-06": {
96+
"stats": {
97+
"sample_total": 500,
98+
"sample_invalid": 6,
99+
"percent_invalid": "1.2%",
100+
"total_documents": 86576,
101+
"total_estimated_invalid": 1039,
102+
"sample_percent_of_total": "0.58%"
103+
},
104+
"sourcearchive/mavencentral/org.apache.kerby/kerby-util/1.0.1": {
105+
"db": {
106+
"licensed": null,
107+
"_meta": {
108+
"schemaVersion": "1.6.1",
109+
"updated": "2024-06-13T12:59:21.981Z"
110+
}
111+
},
112+
"blob": {
113+
"licensed": "Apache-2.0",
114+
"_meta": {
115+
"schemaVersion": "1.6.1",
116+
"updated": "2024-06-13T12:59:31.368Z"
117+
}
118+
}
119+
},
120+
...
121+
}
122+
...
123+
}
124+
```

0 commit comments

Comments
 (0)