Skip to content

Commit b3e8a35

Browse files
update readme
1 parent c9bdbb2 commit b3e8a35

File tree

3 files changed

+30
-22
lines changed

3 files changed

+30
-22
lines changed

README.md

Lines changed: 27 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,64 +2,71 @@
22

33
[![pypi](https://img.shields.io/pypi/v/pyspark-data-sources.svg?color=blue)](https://pypi.org/project/pyspark-data-sources/)
44

5-
This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://issues.apache.org/jira/browse/SPARK-44076) for the upcoming Apache Spark 4.0 release.
5+
This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://spark.apache.org/docs/4.0.0/api/python/tutorial/sql/python_data_source.html) introduced in Apache Spark 4.0.
66
For an in-depth understanding of the API, please refer to the [API source code](https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py).
77
Note this repo is **demo only** and please be aware that it is not intended for production use.
88
Contributions and feedback are welcome to help improve the examples.
99

1010

1111
## Installation
1212
```
13-
pip install pyspark-data-sources[all]
13+
pip install pyspark-data-sources
1414
```
1515

1616
## Usage
17-
Make sure you use pyspark 4.0. You can install pyspark 4.0 [preview version](https://pypi.org/project/pyspark/4.0.0.dev2/)
17+
Make sure you have pyspark >= 4.0.0 installed.
1818

1919
```
20-
pip install "pyspark[connect]==4.0.0.dev2"
20+
pip install pyspark
2121
```
2222

2323
Or use [Databricks Runtime 15.4 LTS](https://docs.databricks.com/aws/en/release-notes/runtime/15.4lts) or above versions, or [Databricks Serverless](https://docs.databricks.com/aws/en/compute/serverless/).
2424

2525

26-
Try the data sources!
27-
2826
```python
29-
from pyspark_datasources.github import GithubDataSource
27+
from pyspark_datasources.fake import FakeDataSource
3028

3129
# Register the data source
32-
spark.dataSource.register(GithubDataSource)
30+
spark.dataSource.register(FakeDataSource)
3331

34-
spark.read.format("github").load("apache/spark").show()
32+
spark.read.format("fake").load().show()
3533
```
3634

35+
## Example Data Sources
36+
37+
| Data Source | Short Name | Description | Dependencies |
38+
|-------------|------------|-------------|--------------|
39+
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
40+
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | None |
41+
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
42+
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Read JSON data from a file | `databricks-sdk` |
43+
3744
See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
3845

46+
## Official Data Sources
47+
48+
For production use, consider these official data source implementations built with the Python Data Source API:
49+
50+
| Data Source | Repository | Description | Features |
51+
|-------------|------------|-------------|----------|
52+
| **HuggingFace Datasets** | [@huggingface/pyspark_huggingface](https://github.com/huggingface/pyspark_huggingface) | Production-ready Spark Data Source for 🤗 Hugging Face Datasets | • Stream datasets as Spark DataFrames<br>• Select subsets/splits with filters<br>• Authentication support<br>• Save DataFrames to Hugging Face<br> |
53+
3954
## Contributing
40-
We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:
55+
We welcome and appreciate any contributions to enhance and expand the custom data sources.:
4156

4257
- **Add New Data Sources**: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
4358
- **Suggest Enhancements**: If you have ideas to improve a data source or the API, we'd love to hear them!
4459
- **Report Bugs**: Found something that doesn't work as expected? Let us know by opening an issue.
4560

46-
**Need help or have questions?** Don't hesitate to open a new issue, and we'll do our best to assist you.
4761

4862
## Development
49-
63+
### Environment Setup
5064
```
5165
poetry install --all-extras
5266
poetry shell
5367
```
5468

55-
### Install PySpark from the latest Spark master
56-
- Clone the Apache Spark repo: `git clone [email protected]:apache/spark.git`
57-
- Build Spark: `build/sbt clean package`
58-
- Build PySpark: `cd python/packaging/classic && python setup.py sdist`
59-
- Install PySpark: `poetry run pip install <path-to-spark-repo>/python/dist/pyspark-4.1.0.dev0.tar.gz`
60-
61-
62-
### Build docs
69+
### Build Docs
6370
```
6471
mkdocs serve
6572
```

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ packages = [
1313
python = ">=3.9,<=3.12"
1414
pyarrow = ">=11.0.0"
1515
requests = "^2.31.0"
16-
faker = {version = "^23.1.0", optional = true}
16+
faker = "^23.1.0"
1717
mkdocstrings = {extras = ["python"], version = "^0.24.0"}
1818
datasets = {version = "^2.17.0", optional = true}
1919
databricks-sdk = {version = "^0.28.0", optional = true}

pyspark_datasources/huggingface.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
class HuggingFaceDatasets(DataSource):
66
"""
7-
A DataSource for reading HuggingFace Datasets in Spark.
7+
An example data source for reading HuggingFace Datasets in Spark.
88
99
This data source allows reading public datasets from the HuggingFace Hub directly into Spark
1010
DataFrames. The schema is automatically inferred from the dataset features. The split can be
@@ -14,6 +14,7 @@ class HuggingFaceDatasets(DataSource):
1414
1515
Notes:
1616
-----
17+
- Please use the official HuggingFace Datasets API: https://github.com/huggingface/pyspark_huggingface.
1718
- The HuggingFace `datasets` library is required to use this data source. Make sure it is installed.
1819
- If the schema is automatically inferred, it will use string type for all fields.
1920
- Currently it can only be used with public datasets. Private or gated ones are not supported.

0 commit comments

Comments
 (0)