NetCDF Spark Parser

A Spark connector for efficiently parsing and reading NetCDF files at scale using Apache Spark. This project leverages the DataSource V2 API to integrate NetCDF file reading in a distributed and performant way. This parser uses NetCDF Java to read data from netcdf files.

🚀 Features

Custom Schema Support: Define the schema for NetCDF variables.
Partition Handling: Automatically manages partitions for large netcdf files.
Scalable Performance: Optimized for distributed computing with Spark.
Storage Compatibility: This connector supports reading NetCDF files from:
- Local file systems (tested).
- Amazon S3, see Dataset URLs for configuration (tested).

📋 Requirements

Java: Version 11+
Apache Spark: Version 3.5.x
Scala: Version 2.12,2.13
Dependency Management: SBT, Maven, or similar
Unidata repository: Add Unidata repository, see Using netCDF-Java Maven Artifacts

🧰 Use Cases

Transform multi dimensional data to tabular form.
Processing climate and oceanographic data.
Analyzing multi-dimensional scientific datasets.
Batch processing of NetCDF files.

📖 Usage

Loading data from a NetCDF file into a DataFrame requires that the variables to extract share at least one common dimension.

Add Dependency to Your Project

To integrate the NetCDF Spark connector into your project, add the following dependency to your preferred build tool configuration.

Using SBT

Add the following line to your file: build.sbt

libraryDependencies += "io.github.rejeb" %% "netcdf-spark-parser" % "1.0.0"

Using Maven

Include the following dependency in the section of your file: <dependencies>``pom.xml

<dependency>
    <groupId>io.github.rejeb</groupId>
    <artifactId>netcdf-spark-parser_2.13</artifactId>
    <version>1.0.0</version>
</dependency>

Note: Change _2.13 to _2.12 if your project uses Scala 2.12 instead of 2.13.

Using Gradle

For Gradle, add this dependency to the dependencies block of your file: build.gradle

dependencies {
    implementation 'io.github.rejeb:netcdf-spark-parser_2.13:1.0.0'
}

Hint: Ensure that the Scala version in the artifact matches your project setup (e.g., _2.12 or _2.13).

Define Your NetCDF Schema

NetCDF requires an explicitly defined schema for variable mapping. Here is an example schema definition:

val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))

Load NetCDF Files

Here is how to load a NetCDF file into a DataFrame:

val spark = SparkSession.builder().appName("NetCDF File Reader").master("local[*]").getOrCreate()
val df = spark.read.format("netcdf")
  .schema(schema)
  .option("path", "/path/to/your/netcdf-file.nc")
  .load()
df.show()

Configuration Options

Option	Description	Required	Default
`path`	Path to the NetCDF file	Yes	None
`partition.size`	Rows per partition to optimize parallelism	No	20,000 rows
`dimensions.to.ignore`	Comma-separated list of dimensions to ignore	No	None

Example with options:

val df = spark
        .read
        .format("netcdf")
        .schema(schema)
        .option("path", "/path/to/file.nc")
        .option("partition.size", 50000)
        .option("dimensions.to.ignore", "dim1,dim2")
        .load()

Full Sample Pipeline Example

Here is a complete example:

val schema = val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))

val df = spark
        .read
        .format("netcdf")
        .schema(schema)
        .option("path", "/data/example.nc")
        .load()

df.printSchema() df.show()

⚠️ Limitations

Schema inference: Schema inference is not supported; you must explicitly define the schema.
Write Operations: Currently, writing to NetCDF files is not supported.
Common Dimensions: Too many shared dimensions, or a large Cartesian product between them, can cause the parser to fail during partitioning and data reading.

🤝 Contributing

Contributions are welcome! To contribute:

Fork the project
Create a feature branch (git checkout -b feature/my-feature)
Commit your changes (git commit -am 'Add my feature')
Push to your branch (git push origin feature/my-feature)
Create a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NetCDF Spark Parser

🚀 Features

📋 Requirements

🧰 Use Cases

📖 Usage

Add Dependency to Your Project

Using SBT

Using Maven

Using Gradle

Define Your NetCDF Schema

Load NetCDF Files

Configuration Options

Full Sample Pipeline Example

⚠️ Limitations

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rejeb/netcdf-spark-parser

Folders and files

Latest commit

History

Repository files navigation

NetCDF Spark Parser

🚀 Features

📋 Requirements

🧰 Use Cases

📖 Usage

Add Dependency to Your Project

Using SBT

Using Maven

Using Gradle

Define Your NetCDF Schema

Load NetCDF Files

Configuration Options

Full Sample Pipeline Example

⚠️ Limitations

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages