@@ -18,43 +18,53 @@ kernelspec:
18
18
## Learning Goals
19
19
By the end of this tutorial, you will:
20
20
21
- - Understand the format, partitioning, and schema of this dataset.
22
- - Be able to query this dataset for likely stars.
21
+ - Access basic metadata to understand the format and schema of this unified HATS Parquet dataset.
22
+ - Visualize the HATS partitioning of this dataset.
23
+ - Query this dataset for likely stars and create a color-magnitude diagram. (Recreate the figure from
24
+ [ Introduction to Euclid Q1 MER catalog] ( https://caltech-ipac.github.io/irsa-tutorials/tutorials/euclid_access/2_Euclid_intro_MER_catalog.html ) ,
25
+ this time with * all* likely stars.)
23
26
24
27
+++
25
28
26
29
## Introduction
27
30
28
31
+++
29
32
30
- This notebook demonstrates accesses to a copy of the
33
+ This notebook demonstrates accesses to a version of the
31
34
[ Euclid Q1] ( https://irsa.ipac.caltech.edu/data/Euclid/docs/overview_q1.html ) MER Catalogs
32
35
that is in Apache Parquet format, partitioned according to the
33
36
Hierarchical Adaptive Tiling Scheme (HATS), and stored in an AWS S3 bucket.
34
- Parquet is a file format that enables flexible and efficient data access by, among other things,
35
- supporting the application of both column and row filters when reading the data (very similar to a SQL query)
36
- so that only the desired data is loaded into memory.
37
37
38
- This is a single parquet dataset which comprises all three MER Catalogs
38
+ The catalog version accessed here is a single dataset which comprises all three MER Catalogs
39
39
-- MER, MER Morphology, and MER Cutouts -- which have been joined by Object ID.
40
40
Their schemas (pre-join) can be seen at
41
41
[ Euclid Final Catalog description] ( http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/dpcards/mer_finalcatalog.html ) .
42
42
Minor modifications were made to the parquet schema to accommodate the join (de-duplicating column names)
43
43
and for the HATS standard. These differences are shown below.
44
44
45
+ Parquet is a file format that enables flexible and efficient data access by, among other things,
46
+ supporting the application of both column and row filters when reading the data (very similar to a SQL query)
47
+ so that only the desired data is loaded into memory.
48
+
45
49
HATS is a spatial partitioning scheme based on HEALPix that aims to
46
50
produce partitions (files) of roughly equal size.
47
- This makes them more efficient to work with,
51
+ This makes the files more efficient to work with,
48
52
especially for large-scale analyses and/or parallel processing.
49
- This notebook demonstrates the basics.
53
+ It does this by adapting the HEALPix order at which data is partitioned in a given catalog based
54
+ on the on-sky density of the rows it contains.
55
+ In other words, data from dense regions of sky will be partitioned at a higher order
56
+ (i.e., higher resolution; smaller pixel size) than data in sparse regions.
57
+ HATS-aware python packages are being developed to take full advantage of the partitioning.
58
+ In this notebook, we will use the [ hats] ( https://hats.readthedocs.io/ ) library to visualize the
59
+ catalog and access the schema, and [ lsdb] ( https://docs.lsdb.io/ ) to do a query for all likely stars.
50
60
51
61
+++
52
62
53
63
## Installs and imports
54
64
55
65
``` {code-cell}
56
- # !pip uninstall -y numpy pyerfa # Helps resolve numpy>=2.0 dependency issues .
57
- # !pip install 'hats>=0.5' 'lsdb>=0.5' matplotlib numpy s3fs
66
+ # # Uncomment the next line to install dependencies if needed .
67
+ # !pip install 'hats>=0.5' 'lsdb>=0.5' matplotlib ' numpy>=2.0' 'pyerfa>=2.0.1.3' s3fs
58
68
```
59
69
60
70
``` {code-cell}
@@ -74,27 +84,28 @@ If you run into an error that starts with,
74
84
make sure you have restarted the kernel since doing `pip install`. Then re-run the cell.
75
85
```
76
86
87
+ +++
88
+
77
89
## 1. Setup
78
90
79
91
``` {code-cell}
80
- # Need UPath for the testing bucket. Otherwise hats will ignore the credentials that Fornax
81
- # provides under the hood. Will be unnecessary after the dataset is released in a public bucket.
82
- from upath import UPath
83
-
84
92
# AWS S3 path where this dataset is stored.
85
93
s3_bucket = "irsa-fornax-testdata"
86
94
s3_key = "EUCLID/q1/mer_catalogue/hats"
87
- euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}")
88
-
89
- # Note: If running from IPAC, you need an anonymous connection. Uncomment the next line.
90
- # euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}", anon=True)
91
- ```
92
-
93
- We will use [ ` hats ` ] ( https://hats.readthedocs.io/ ) to visualize the catalog and access the schema.
94
-
95
- ``` {code-cell}
96
- # Load the parquet dataset using hats.
97
- euclid_hats = hats.read_hats(euclid_s3_path)
95
+ euclid_s3_path = f"s3://{s3_bucket}/{s3_key}"
96
+
97
+ # Temporary try/except to handle credentials in different environments before public release.
98
+ try:
99
+ # If running from within IPAC's network (maybe VPN'd in with "tunnel-all"),
100
+ # your IP address acts as your credentials and this should just work.
101
+ hats.read_hats(euclid_s3_path)
102
+ except FileNotFoundError:
103
+ # If running from Fornax, credentials are provided automatically under the hood, but
104
+ # hats ignores them in the call above and raises a FileNotFoundError.
105
+ # Construct a UPath which will pick up the credentials.
106
+ from upath import UPath
107
+
108
+ euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}")
98
109
```
99
110
100
111
## 2. Visualize the on-sky density of Q1 Objects and HATS partitions
@@ -105,20 +116,17 @@ Euclid Q1 covers four non-contiguous fields: Euclid Deep Field North (22.9 sq de
105
116
We can visualize the Object density in the four fields using ` hats ` .
106
117
107
118
``` {code-cell}
119
+ # Load the dataset.
120
+ euclid_hats = hats.read_hats(euclid_s3_path)
121
+
108
122
# Visualize the on-sky distribution of objects in the Q1 MER Catalog.
109
123
hats.inspection.plot_density(euclid_hats)
110
124
```
111
125
112
- HATS does this by adjusting the partitioning order (i.e., HEALPix order at which data is partitioned)
113
- according to the on-sky density of the objects or sources (rows) in the dataset.
114
- In other words, dense regions are partitioned at a
115
- higher HEALPix order (smaller pixel size) to reduce the number of objects in those partitions towards the mean;
116
- vice versa for sparse regions.
117
-
118
- We can see this by plotting the partitioning orders.
126
+ We can see how the on-sky density maps to the HATS partitions by calling ` plot_pixels ` .
119
127
120
128
``` {code-cell}
121
- # Visualize the HEALPix order of each partition .
129
+ # Visualize the HEALPix orders of the dataset partitions .
122
130
hats.inspection.plot_pixels(euclid_hats)
123
131
```
124
132
@@ -128,12 +136,10 @@ hats.inspection.plot_pixels(euclid_hats)
128
136
129
137
In this section, we query the Euclid Q1 MER catalogs for likely stars and create a color-magnitude diagram (CMD), following
130
138
[ Introduction to Euclid Q1 MER catalog] ( https://caltech-ipac.github.io/irsa-tutorials/tutorials/euclid_access/2_Euclid_intro_MER_catalog.html ) .
131
- Here, we'll use [ ` lsdb ` ] ( https://docs.lsdb.io/ ) to query the parquet files that are sitting in an S3 bucket (the intro notebook uses ` pyvo ` to query the TAP service).
139
+ Here, we use ` lsdb ` to query the parquet files that are sitting in an S3 bucket (the intro notebook uses ` pyvo ` to query the TAP service).
132
140
` lsdb ` enables efficient, large-scale queries on HATS catalogs, so let's look at * all* likely stars in Euclid Q1 instead of limiting to 10,000.
133
141
134
- +++
135
-
136
- ` lsdb ` uses Dask for parallelization. Set up the workers.
142
+ ` lsdb ` uses Dask for parallelization. So first, set up the workers.
137
143
138
144
``` {code-cell}
139
145
client = dask.distributed.Client(
@@ -144,7 +150,7 @@ client = dask.distributed.Client(
144
150
The data will be lazy-loaded. This means that commands like ` query ` are not executed until the data is actually required.
145
151
146
152
``` {code-cell}
147
- # Load the parquet dataset using lsdb .
153
+ # Load the dataset.
148
154
columns = [
149
155
"TILEID",
150
156
"FLUX_VIS_PSF",
@@ -209,7 +215,9 @@ notebook shows how to work with parquet schemas.
209
215
210
216
``` {code-cell}
211
217
# Fetch the pyarrow schema from hats.
218
+ euclid_hats = hats.read_hats(euclid_s3_path)
212
219
schema = euclid_hats.schema
220
+
213
221
print(f"{len(schema)} columns in the combined Euclid Q1 MER Catalogs")
214
222
```
215
223
@@ -254,6 +262,6 @@ print(schema.field("RIGHT_ASCENSION-CUTOUTS").metadata)
254
262
255
263
** Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team.
256
264
257
- ** Updated:** 2025-03-25
265
+ ** Updated:** 2025-03-29
258
266
259
267
** Contact:** [ IRSA Helpdesk] ( https://irsa.ipac.caltech.edu/docs/help_desk.html ) with questions or problems.
0 commit comments