Skip to content

Commit cd88285

Browse files
committed
Apply suggestions from @afaisst and @bsipocz code reviews.
1 parent a8395c7 commit cd88285

File tree

1 file changed

+48
-40
lines changed

1 file changed

+48
-40
lines changed

tutorials/parquet-catalog-demos/euclid-hats-parquet.md

Lines changed: 48 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -18,43 +18,53 @@ kernelspec:
1818
## Learning Goals
1919
By the end of this tutorial, you will:
2020

21-
- Understand the format, partitioning, and schema of this dataset.
22-
- Be able to query this dataset for likely stars.
21+
- Access basic metadata to understand the format and schema of this unified HATS Parquet dataset.
22+
- Visualize the HATS partitioning of this dataset.
23+
- Query this dataset for likely stars and create a color-magnitude diagram. (Recreate the figure from
24+
[Introduction to Euclid Q1 MER catalog](https://caltech-ipac.github.io/irsa-tutorials/tutorials/euclid_access/2_Euclid_intro_MER_catalog.html),
25+
this time with *all* likely stars.)
2326

2427
+++
2528

2629
## Introduction
2730

2831
+++
2932

30-
This notebook demonstrates accesses to a copy of the
33+
This notebook demonstrates accesses to a version of the
3134
[Euclid Q1](https://irsa.ipac.caltech.edu/data/Euclid/docs/overview_q1.html) MER Catalogs
3235
that is in Apache Parquet format, partitioned according to the
3336
Hierarchical Adaptive Tiling Scheme (HATS), and stored in an AWS S3 bucket.
34-
Parquet is a file format that enables flexible and efficient data access by, among other things,
35-
supporting the application of both column and row filters when reading the data (very similar to a SQL query)
36-
so that only the desired data is loaded into memory.
3737

38-
This is a single parquet dataset which comprises all three MER Catalogs
38+
The catalog version accessed here is a single dataset which comprises all three MER Catalogs
3939
-- MER, MER Morphology, and MER Cutouts -- which have been joined by Object ID.
4040
Their schemas (pre-join) can be seen at
4141
[Euclid Final Catalog description](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/dpcards/mer_finalcatalog.html).
4242
Minor modifications were made to the parquet schema to accommodate the join (de-duplicating column names)
4343
and for the HATS standard. These differences are shown below.
4444

45+
Parquet is a file format that enables flexible and efficient data access by, among other things,
46+
supporting the application of both column and row filters when reading the data (very similar to a SQL query)
47+
so that only the desired data is loaded into memory.
48+
4549
HATS is a spatial partitioning scheme based on HEALPix that aims to
4650
produce partitions (files) of roughly equal size.
47-
This makes them more efficient to work with,
51+
This makes the files more efficient to work with,
4852
especially for large-scale analyses and/or parallel processing.
49-
This notebook demonstrates the basics.
53+
It does this by adapting the HEALPix order at which data is partitioned in a given catalog based
54+
on the on-sky density of the rows it contains.
55+
In other words, data from dense regions of sky will be partitioned at a higher order
56+
(i.e., higher resolution; smaller pixel size) than data in sparse regions.
57+
HATS-aware python packages are being developed to take full advantage of the partitioning.
58+
In this notebook, we will use the [hats](https://hats.readthedocs.io/) library to visualize the
59+
catalog and access the schema, and [lsdb](https://docs.lsdb.io/) to do a query for all likely stars.
5060

5161
+++
5262

5363
## Installs and imports
5464

5565
```{code-cell}
56-
# !pip uninstall -y numpy pyerfa # Helps resolve numpy>=2.0 dependency issues.
57-
# !pip install 'hats>=0.5' 'lsdb>=0.5' matplotlib numpy s3fs
66+
# # Uncomment the next line to install dependencies if needed.
67+
# !pip install 'hats>=0.5' 'lsdb>=0.5' matplotlib 'numpy>=2.0' 'pyerfa>=2.0.1.3' s3fs
5868
```
5969

6070
```{code-cell}
@@ -74,27 +84,28 @@ If you run into an error that starts with,
7484
make sure you have restarted the kernel since doing `pip install`. Then re-run the cell.
7585
```
7686

87+
+++
88+
7789
## 1. Setup
7890

7991
```{code-cell}
80-
# Need UPath for the testing bucket. Otherwise hats will ignore the credentials that Fornax
81-
# provides under the hood. Will be unnecessary after the dataset is released in a public bucket.
82-
from upath import UPath
83-
8492
# AWS S3 path where this dataset is stored.
8593
s3_bucket = "irsa-fornax-testdata"
8694
s3_key = "EUCLID/q1/mer_catalogue/hats"
87-
euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}")
88-
89-
# Note: If running from IPAC, you need an anonymous connection. Uncomment the next line.
90-
# euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}", anon=True)
91-
```
92-
93-
We will use [`hats`](https://hats.readthedocs.io/) to visualize the catalog and access the schema.
94-
95-
```{code-cell}
96-
# Load the parquet dataset using hats.
97-
euclid_hats = hats.read_hats(euclid_s3_path)
95+
euclid_s3_path = f"s3://{s3_bucket}/{s3_key}"
96+
97+
# Temporary try/except to handle credentials in different environments before public release.
98+
try:
99+
# If running from within IPAC's network (maybe VPN'd in with "tunnel-all"),
100+
# your IP address acts as your credentials and this should just work.
101+
hats.read_hats(euclid_s3_path)
102+
except FileNotFoundError:
103+
# If running from Fornax, credentials are provided automatically under the hood, but
104+
# hats ignores them in the call above and raises a FileNotFoundError.
105+
# Construct a UPath which will pick up the credentials.
106+
from upath import UPath
107+
108+
euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}")
98109
```
99110

100111
## 2. Visualize the on-sky density of Q1 Objects and HATS partitions
@@ -105,20 +116,17 @@ Euclid Q1 covers four non-contiguous fields: Euclid Deep Field North (22.9 sq de
105116
We can visualize the Object density in the four fields using `hats`.
106117

107118
```{code-cell}
119+
# Load the dataset.
120+
euclid_hats = hats.read_hats(euclid_s3_path)
121+
108122
# Visualize the on-sky distribution of objects in the Q1 MER Catalog.
109123
hats.inspection.plot_density(euclid_hats)
110124
```
111125

112-
HATS does this by adjusting the partitioning order (i.e., HEALPix order at which data is partitioned)
113-
according to the on-sky density of the objects or sources (rows) in the dataset.
114-
In other words, dense regions are partitioned at a
115-
higher HEALPix order (smaller pixel size) to reduce the number of objects in those partitions towards the mean;
116-
vice versa for sparse regions.
117-
118-
We can see this by plotting the partitioning orders.
126+
We can see how the on-sky density maps to the HATS partitions by calling `plot_pixels`.
119127

120128
```{code-cell}
121-
# Visualize the HEALPix order of each partition.
129+
# Visualize the HEALPix orders of the dataset partitions.
122130
hats.inspection.plot_pixels(euclid_hats)
123131
```
124132

@@ -128,12 +136,10 @@ hats.inspection.plot_pixels(euclid_hats)
128136

129137
In this section, we query the Euclid Q1 MER catalogs for likely stars and create a color-magnitude diagram (CMD), following
130138
[Introduction to Euclid Q1 MER catalog](https://caltech-ipac.github.io/irsa-tutorials/tutorials/euclid_access/2_Euclid_intro_MER_catalog.html).
131-
Here, we'll use [`lsdb`](https://docs.lsdb.io/) to query the parquet files that are sitting in an S3 bucket (the intro notebook uses `pyvo` to query the TAP service).
139+
Here, we use `lsdb` to query the parquet files that are sitting in an S3 bucket (the intro notebook uses `pyvo` to query the TAP service).
132140
`lsdb` enables efficient, large-scale queries on HATS catalogs, so let's look at *all* likely stars in Euclid Q1 instead of limiting to 10,000.
133141

134-
+++
135-
136-
`lsdb` uses Dask for parallelization. Set up the workers.
142+
`lsdb` uses Dask for parallelization. So first, set up the workers.
137143

138144
```{code-cell}
139145
client = dask.distributed.Client(
@@ -144,7 +150,7 @@ client = dask.distributed.Client(
144150
The data will be lazy-loaded. This means that commands like `query` are not executed until the data is actually required.
145151

146152
```{code-cell}
147-
# Load the parquet dataset using lsdb.
153+
# Load the dataset.
148154
columns = [
149155
"TILEID",
150156
"FLUX_VIS_PSF",
@@ -209,7 +215,9 @@ notebook shows how to work with parquet schemas.
209215

210216
```{code-cell}
211217
# Fetch the pyarrow schema from hats.
218+
euclid_hats = hats.read_hats(euclid_s3_path)
212219
schema = euclid_hats.schema
220+
213221
print(f"{len(schema)} columns in the combined Euclid Q1 MER Catalogs")
214222
```
215223

@@ -254,6 +262,6 @@ print(schema.field("RIGHT_ASCENSION-CUTOUTS").metadata)
254262

255263
**Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team.
256264

257-
**Updated:** 2025-03-25
265+
**Updated:** 2025-03-29
258266

259267
**Contact:** [IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or problems.

0 commit comments

Comments
 (0)