A Spark connector for efficiently parsing and reading NetCDF files at scale using Apache Spark. This project leverages the DataSource V2 API to integrate NetCDF file reading in a distributed and performant way. This parser uses NetCDF Java to read data from netcdf files.
- Custom Schema Support: Define the schema for NetCDF variables.
- Partition Handling: Automatically manages partitions for large netcdf files.
- Scalable Performance: Optimized for distributed computing with Spark.
- Storage Compatibility: This connector supports reading NetCDF files from:
- Local file systems (tested).
- Amazon S3, see Dataset URLs for configuration (tested).
- Java: Version 11+
- Apache Spark: Version 3.5.x
- Scala: Version 2.12,2.13
- Dependency Management: SBT, Maven, or similar
- Unidata repository: Add Unidata repository, see Using netCDF-Java Maven Artifacts
- Transform multi dimensional data to tabular form.
- Processing climate and oceanographic data.
- Analyzing multi-dimensional scientific datasets.
- Batch processing of NetCDF files.
Loading data from a NetCDF file into a DataFrame requires that the variables to extract share at least one common dimension.
To integrate the NetCDF Spark connector into your project, add the following dependency to your preferred build tool configuration.
Add the following line to your file: build.sbt
libraryDependencies += "io.github.rejeb" %% "netcdf-spark-parser" % "1.0.0"
Include the following dependency in the section of your file: <dependencies>``pom.xml
<dependency>
<groupId>io.github.rejeb</groupId>
<artifactId>netcdf-spark-parser_2.13</artifactId>
<version>1.0.0</version>
</dependency>
Note: Change
_2.13
to_2.12
if your project uses Scala 2.12 instead of 2.13.
For Gradle, add this dependency to the dependencies
block of your file: build.gradle
dependencies {
implementation 'io.github.rejeb:netcdf-spark-parser_2.13:1.0.0'
}
Hint: Ensure that the Scala version in the artifact matches your project setup (e.g.,
_2.12
or_2.13
).
NetCDF requires an explicitly defined schema for variable mapping. Here is an example schema definition:
val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))
Here is how to load a NetCDF file into a DataFrame:
val spark = SparkSession.builder().appName("NetCDF File Reader").master("local[*]").getOrCreate()
val df = spark.read.format("netcdf")
.schema(schema)
.option("path", "/path/to/your/netcdf-file.nc")
.load()
df.show()
Option | Description | Required | Default |
---|---|---|---|
path |
Path to the NetCDF file | Yes | None |
partition.size |
Rows per partition to optimize parallelism | No | 20,000 rows |
dimensions.to.ignore |
Comma-separated list of dimensions to ignore | No | None |
Example with options:
val df = spark
.read
.format("netcdf")
.schema(schema)
.option("path", "/path/to/file.nc")
.option("partition.size", 50000)
.option("dimensions.to.ignore", "dim1,dim2")
.load()
Here is a complete example:
val schema = val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))
val df = spark
.read
.format("netcdf")
.schema(schema)
.option("path", "/data/example.nc")
.load()
df.printSchema() df.show()
- Schema inference: Schema inference is not supported; you must explicitly define the schema.
- Write Operations: Currently, writing to NetCDF files is not supported.
- Common Dimensions: Too many shared dimensions, or a large Cartesian product between them, can cause the parser to fail during partitioning and data reading.
Contributions are welcome! To contribute:
- Fork the project
- Create a feature branch (
git checkout -b feature/my-feature
) - Commit your changes (
git commit -am 'Add my feature'
) - Push to your branch (
git push origin feature/my-feature
) - Create a Pull Request
This project is licensed under the MIT License. See the LICENSE file for details.