Open petroleum datasets in Parquet format for data science and machine learning applications.
Production metrics from the Equinor Volve oil field (2007-2016):
- 7 wells with daily and monthly production records
- 15,634 daily measurements - pressure, temperature, oil/gas/water volumes
- 526 monthly aggregates - production volumes in Sm3
Well log data from 108 wells in the Norwegian Continental Shelf:
- 29 columns of petrophysical measurements per well
- Log curves: Gamma Ray (GR), Density (RHOB), Porosity (NPHI), Resistivity, Sonic
- Lithofacies classifications for machine learning applications
Monthly oil and gas production for ~85,418 wells in Argentina (2006–present), sourced from the Secretaría de Energía public datasets:
- wells.parquet — static well master (~85K rows, Spanish column names)
- well_operator_history.parquet — slowly-changing operator transfers
- well_events.parquet — operational state transitions
- monthly_production/ — hive-partitioned by
anio, ~17.6M rows total
Aggregate 2023 production by basin, joining wells to the partitioned
monthly time series. The static host serves no directory listing, so the
partition files are discovered from the _files.json manifest rather than
globbed (ADR-0004):
import json, urllib.request, duckdb
base = 'https://huggingface.co/datasets/sumpalabs/petrodb/resolve/main/argentina/monthly_production/'
manifest = json.load(urllib.request.urlopen(base + '_files.json'))
urls = [base + p for p in manifest if p.startswith('anio=2023/')]
result = duckdb.sql(f"""
SELECT w.cuenca,
SUM(m.prod_pet) AS oil_m3,
SUM(m.prod_gas) AS gas_mm3
FROM 'https://huggingface.co/datasets/sumpalabs/petrodb/resolve/main/argentina/wells.parquet' w
JOIN read_parquet(?, hive_partitioning = true) m USING (idpozo)
WHERE m.anio = 2023
GROUP BY w.cuenca
ORDER BY oil_m3 DESC
""", params=[urls]).df()Full per-column English docs (Spanish column identifiers preserved), the
four-bucket rationale, and three more canonical query patterns live in
parquet/argentina/README.md.
Labelled 1-Hz sensor-data windows from the Petrobras 3W dataset, sliced
into per-Instance Parquet files. Pinned at upstream git tag v.1.70.0
(dataset version 2.0.0). This release publishes the
event-class lookup, the real-Well master, the full Instance catalog, and
the per-Instance Observations time-series (hive-partitioned by event class).
Measure the labelled-data balance across the corpus from the catalog alone (no Observations scan needed):
import duckdb
base = 'https://huggingface.co/datasets/sumpalabs/petrodb/resolve/main/petrobras_3w'
result = duckdb.sql(f"""
SELECT
et.event_class,
et.description,
COUNT(*) AS n_instances,
SUM(i.n_rows) AS n_observations
FROM '{base}/instances.parquet' i
JOIN '{base}/event_types.parquet' et
ON et.event_class = i.event_class
GROUP BY et.event_class, et.description
ORDER BY et.event_class
""").df()Full per-column English docs (including the 27-sensor glossary mirrored
from upstream dataset.ini) live in
parquet/petrobras_3w/README.md. Upstream
source: https://github.com/petrobras/3W.git (CC BY 4.0).
Browse and download files at: https://dev-petrodb.ocortez.com
Query directly with DuckDB (no download required):
Data lives on Hugging Face (sumpalabs/petrodb); its resolve URLs honour HTTP Range so DuckDB fetches only the row groups a query needs.
import json, urllib.request, duckdb
HF = "https://huggingface.co/datasets/sumpalabs/petrodb/resolve/main"
conn = duckdb.connect()
# Query Volve production data
volve = conn.execute(f"""
SELECT
w.wellbore_name,
SUM(d.oil_volume) as total_oil,
SUM(d.gas_volume) as total_gas
FROM '{HF}/volve/daily_production.parquet' d
JOIN '{HF}/volve/wells.parquet' w
ON d.npd_wellbore_code = w.npd_wellbore_code
GROUP BY w.wellbore_name
ORDER BY total_oil DESC
""").fetchdf()
# Query Force 2020 well logs (one Parquet file per well)
force = conn.execute(f"""
SELECT
WELL,
AVG(GR) as avg_gamma_ray,
AVG(RHOB) as avg_density,
COUNT(*) as samples
FROM '{HF}/force_2020/wells/15-9-13.parquet'
GROUP BY WELL
""").fetchdf()
# Query all 108 wells at once — discover files from the manifest, never a glob
# (the static host has no directory listing, see docs/adr/0004-...)
base = f"{HF}/force_2020/wells/"
urls = [base + n for n in json.load(urllib.request.urlopen(base + "_files.json"))]
all_wells = conn.execute("""
SELECT WELL, FORMATION, COUNT(*) as samples
FROM read_parquet(?)
WHERE FORMATION IS NOT NULL
GROUP BY WELL, FORMATION
ORDER BY WELL, samples DESC
""", [urls]).fetchdf()Or download files locally and query:
import duckdb
conn = duckdb.connect()
# Query local Volve files
result = conn.execute("""
SELECT * FROM 'parquet/volve/daily_production.parquet'
WHERE date BETWEEN '2008-01-01' AND '2008-12-31'
""").fetchdf()
# Query local Force 2020 files
well_data = conn.execute("""
SELECT * FROM 'parquet/force_2020/wells/15-9-13.parquet'
WHERE DEPTH_MD > 3000
""").fetchdf()parquet/
├── volve/ # Volve production data
│ ├── daily_production.parquet
│ ├── monthly_production.parquet
│ ├── wells.parquet
│ └── schema.json
└── force_2020/ # FORCE 2020 well logs
└── wells/ # 108 well files
├── 15-9-13.parquet
├── 34-10-16_R.parquet
└── ... (106 more)
- Equinor - Volve field dataset
- FORCE (Norwegian Oil and Gas Association) and Xeek - FORCE 2020 ML Competition dataset
Both datasets are provided for research and educational purposes.
See original license terms: