Skip to content

rejeb/netcdf-spark-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NetCDF Spark Parser

GitHub stars License Scala Scala Spark

A Spark connector for efficiently parsing and reading NetCDF files at scale using Apache Spark. This project leverages the DataSource V2 API to integrate NetCDF file reading in a distributed and performant way. This parser uses NetCDF Java to read data from netcdf files.


🚀 Features

  • Custom Schema Support: Define the schema for NetCDF variables.
  • Partition Handling: Automatically manages partitions for large netcdf files.
  • Scalable Performance: Optimized for distributed computing with Spark.
  • Storage Compatibility: This connector supports reading NetCDF files from:
    • Local file systems (tested).
    • Amazon S3, see Dataset URLs for configuration (tested).

📋 Requirements

  • Java: Version 11+
  • Apache Spark: Version 3.5.x
  • Scala: Version 2.12,2.13
  • Dependency Management: SBT, Maven, or similar
  • Unidata repository: Add Unidata repository, see Using netCDF-Java Maven Artifacts

🧰 Use Cases

  • Transform multi dimensional data to tabular form.
  • Processing climate and oceanographic data.
  • Analyzing multi-dimensional scientific datasets.
  • Batch processing of NetCDF files.

📖 Usage

Loading data from a NetCDF file into a DataFrame requires that the variables to extract share at least one common dimension.

Add Dependency to Your Project

To integrate the NetCDF Spark connector into your project, add the following dependency to your preferred build tool configuration.

Using SBT

Add the following line to your file: build.sbt

libraryDependencies += "io.github.rejeb" %% "netcdf-spark-parser" % "1.0.0"

Using Maven

Include the following dependency in the section of your file: <dependencies>``pom.xml

<dependency>
    <groupId>io.github.rejeb</groupId>
    <artifactId>netcdf-spark-parser_2.13</artifactId>
    <version>1.0.0</version>
</dependency>

Note: Change _2.13 to _2.12 if your project uses Scala 2.12 instead of 2.13.

Using Gradle

For Gradle, add this dependency to the dependencies block of your file: build.gradle

dependencies {
    implementation 'io.github.rejeb:netcdf-spark-parser_2.13:1.0.0'
}

Hint: Ensure that the Scala version in the artifact matches your project setup (e.g., _2.12 or _2.13).


Define Your NetCDF Schema

NetCDF requires an explicitly defined schema for variable mapping. Here is an example schema definition:

val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))

Load NetCDF Files

Here is how to load a NetCDF file into a DataFrame:

val spark = SparkSession.builder().appName("NetCDF File Reader").master("local[*]").getOrCreate()
val df = spark.read.format("netcdf")
  .schema(schema)
  .option("path", "/path/to/your/netcdf-file.nc")
  .load()
df.show()

Configuration Options

Option Description Required Default
path Path to the NetCDF file Yes None
partition.size Rows per partition to optimize parallelism No 20,000 rows
dimensions.to.ignore Comma-separated list of dimensions to ignore No None

Example with options:

val df = spark
        .read
        .format("netcdf")
        .schema(schema)
        .option("path", "/path/to/file.nc")
        .option("partition.size", 50000)
        .option("dimensions.to.ignore", "dim1,dim2")
        .load()

Full Sample Pipeline Example

Here is a complete example:

val schema = val schema = StructType(Seq(
StructField("temperature", FloatType),
StructField("humidity", FloatType),
StructField("timestamp", StringType),
StructField("metadata", ArrayType(StringType))
))

val df = spark
        .read
        .format("netcdf")
        .schema(schema)
        .option("path", "/data/example.nc")
        .load()

df.printSchema() df.show()

⚠️ Limitations

  • Schema inference: Schema inference is not supported; you must explicitly define the schema.
  • Write Operations: Currently, writing to NetCDF files is not supported.
  • Common Dimensions: Too many shared dimensions, or a large Cartesian product between them, can cause the parser to fail during partitioning and data reading.

🤝 Contributing

Contributions are welcome! To contribute:

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit your changes (git commit -am 'Add my feature')
  4. Push to your branch (git push origin feature/my-feature)
  5. Create a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Scala/Spark Netcdf for reading Netcdf files

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages