update readme

allisonwang-db · allisonwang-db · commit b3e8a35bc22b · 2025-06-03T16:30:23.000-07:00
diff --git a/README.md b/README.md
@@ -2,64 +2,71 @@
 
 [![pypi](https://img.shields.io/pypi/v/pyspark-data-sources.svg?color=blue)](https://pypi.org/project/pyspark-data-sources/)
 
-This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://issues.apache.org/jira/browse/SPARK-44076) for the upcoming Apache Spark 4.0 release.
+This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://spark.apache.org/docs/4.0.0/api/python/tutorial/sql/python_data_source.html) introduced in Apache Spark 4.0.
 For an in-depth understanding of the API, please refer to the [API source code](https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py).
 Note this repo is **demo only** and please be aware that it is not intended for production use.
 Contributions and feedback are welcome to help improve the examples.
 
 
 ## Installation
 ```
-pip install pyspark-data-sources[all]
+pip install pyspark-data-sources
 ```
 
 ## Usage
-Make sure you use pyspark 4.0. You can install pyspark 4.0 [preview version](https://pypi.org/project/pyspark/4.0.0.dev2/)
+Make sure you have pyspark >= 4.0.0 installed. 
 
 ```
-pip install "pyspark[connect]==4.0.0.dev2"
+pip install pyspark
 ```
 
 Or use [Databricks Runtime 15.4 LTS](https://docs.databricks.com/aws/en/release-notes/runtime/15.4lts) or above versions, or [Databricks Serverless](https://docs.databricks.com/aws/en/compute/serverless/).
 
 
-Try the data sources!
-
 ```python
-from pyspark_datasources.github import GithubDataSource
+from pyspark_datasources.fake import FakeDataSource
 
 # Register the data source
-spark.dataSource.register(GithubDataSource)
+spark.dataSource.register(FakeDataSource)
 
-spark.read.format("github").load("apache/spark").show()
+spark.read.format("fake").load().show()
 ```
 
+## Example Data Sources
+
+| Data Source | Short Name | Description | Dependencies |
+|-------------|------------|-------------|--------------|
+| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
+| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | None |
+| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
+| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Read JSON data from a file | `databricks-sdk` |
+
 See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
 
+## Official Data Sources
+
+For production use, consider these official data source implementations built with the Python Data Source API:
+
+| Data Source | Repository | Description | Features |
+|-------------|------------|-------------|----------|
+| **HuggingFace Datasets** | [@huggingface/pyspark_huggingface](https://github.com/huggingface/pyspark_huggingface) | Production-ready Spark Data Source for 🤗 Hugging Face Datasets | • Stream datasets as Spark DataFrames<br>• Select subsets/splits with filters<br>• Authentication support<br>• Save DataFrames to Hugging Face<br> |
+
 ## Contributing
-We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:
+We welcome and appreciate any contributions to enhance and expand the custom data sources.:
 
 - **Add New Data Sources**: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
 - **Suggest Enhancements**: If you have ideas to improve a data source or the API, we'd love to hear them!
 - **Report Bugs**: Found something that doesn't work as expected? Let us know by opening an issue.
 
-**Need help or have questions?** Don't hesitate to open a new issue, and we'll do our best to assist you.
 
 ## Development
-
+### Environment Setup
 ```
 poetry install --all-extras
 poetry shell
 ```
 
-### Install PySpark from the latest Spark master
-- Clone the Apache Spark repo: `git clone git@github.com:apache/spark.git`
-- Build Spark: `build/sbt clean package`
-- Build PySpark: `cd python/packaging/classic && python setup.py sdist`
-- Install PySpark: `poetry run pip install <path-to-spark-repo>/python/dist/pyspark-4.1.0.dev0.tar.gz`
-
-
-### Build docs
+### Build Docs
 ```
 mkdocs serve
 ```
diff --git a/pyproject.toml b/pyproject.toml
@@ -13,7 +13,7 @@ packages = [
 python = ">=3.9,<=3.12"
 pyarrow = ">=11.0.0"
 requests = "^2.31.0"
-faker = {version = "^23.1.0", optional = true}
+faker = "^23.1.0"
 mkdocstrings = {extras = ["python"], version = "^0.24.0"}
 datasets = {version = "^2.17.0", optional = true}
 databricks-sdk = {version = "^0.28.0", optional = true}
diff --git a/pyspark_datasources/huggingface.py b/pyspark_datasources/huggingface.py
@@ -4,7 +4,7 @@
 
 class HuggingFaceDatasets(DataSource):
     """
-    A DataSource for reading HuggingFace Datasets in Spark.
+    An example data source for reading HuggingFace Datasets in Spark.
 
     This data source allows reading public datasets from the HuggingFace Hub directly into Spark
     DataFrames. The schema is automatically inferred from the dataset features. The split can be
@@ -14,6 +14,7 @@ class HuggingFaceDatasets(DataSource):
 
     Notes:
     -----
+    - Please use the official HuggingFace Datasets API: https://github.com/huggingface/pyspark_huggingface.
     - The HuggingFace `datasets` library is required to use this data source. Make sure it is installed.
     - If the schema is automatically inferred, it will use string type for all fields.
     - Currently it can only be used with public datasets. Private or gated ones are not supported.