|
2 | 2 |
|
3 | 3 | [](https://pypi.org/project/pyspark-data-sources/)
|
4 | 4 |
|
5 |
| -This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://issues.apache.org/jira/browse/SPARK-44076) for the upcoming Apache Spark 4.0 release. |
| 5 | +This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://spark.apache.org/docs/4.0.0/api/python/tutorial/sql/python_data_source.html) introduced in Apache Spark 4.0. |
6 | 6 | For an in-depth understanding of the API, please refer to the [API source code](https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py).
|
7 | 7 | Note this repo is **demo only** and please be aware that it is not intended for production use.
|
8 | 8 | Contributions and feedback are welcome to help improve the examples.
|
9 | 9 |
|
10 | 10 |
|
11 | 11 | ## Installation
|
12 | 12 | ```
|
13 |
| -pip install pyspark-data-sources[all] |
| 13 | +pip install pyspark-data-sources |
14 | 14 | ```
|
15 | 15 |
|
16 | 16 | ## Usage
|
17 |
| -Make sure you use pyspark 4.0. You can install pyspark 4.0 [preview version](https://pypi.org/project/pyspark/4.0.0.dev2/) |
| 17 | +Make sure you have pyspark >= 4.0.0 installed. |
18 | 18 |
|
19 | 19 | ```
|
20 |
| -pip install "pyspark[connect]==4.0.0.dev2" |
| 20 | +pip install pyspark |
21 | 21 | ```
|
22 | 22 |
|
23 | 23 | Or use [Databricks Runtime 15.4 LTS](https://docs.databricks.com/aws/en/release-notes/runtime/15.4lts) or above versions, or [Databricks Serverless](https://docs.databricks.com/aws/en/compute/serverless/).
|
24 | 24 |
|
25 | 25 |
|
26 |
| -Try the data sources! |
27 |
| - |
28 | 26 | ```python
|
29 |
| -from pyspark_datasources.github import GithubDataSource |
| 27 | +from pyspark_datasources.fake import FakeDataSource |
30 | 28 |
|
31 | 29 | # Register the data source
|
32 |
| -spark.dataSource.register(GithubDataSource) |
| 30 | +spark.dataSource.register(FakeDataSource) |
33 | 31 |
|
34 |
| -spark.read.format("github").load("apache/spark").show() |
| 32 | +spark.read.format("fake").load().show() |
35 | 33 | ```
|
36 | 34 |
|
| 35 | +## Example Data Sources |
| 36 | + |
| 37 | +| Data Source | Short Name | Description | Dependencies | |
| 38 | +|-------------|------------|-------------|--------------| |
| 39 | +| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None | |
| 40 | +| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | None | |
| 41 | +| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None | |
| 42 | +| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Read JSON data from a file | `databricks-sdk` | |
| 43 | + |
37 | 44 | See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
|
38 | 45 |
|
| 46 | +## Official Data Sources |
| 47 | + |
| 48 | +For production use, consider these official data source implementations built with the Python Data Source API: |
| 49 | + |
| 50 | +| Data Source | Repository | Description | Features | |
| 51 | +|-------------|------------|-------------|----------| |
| 52 | +| **HuggingFace Datasets** | [@huggingface/pyspark_huggingface](https://github.com/huggingface/pyspark_huggingface) | Production-ready Spark Data Source for 🤗 Hugging Face Datasets | • Stream datasets as Spark DataFrames<br>• Select subsets/splits with filters<br>• Authentication support<br>• Save DataFrames to Hugging Face<br> | |
| 53 | + |
39 | 54 | ## Contributing
|
40 |
| -We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing: |
| 55 | +We welcome and appreciate any contributions to enhance and expand the custom data sources.: |
41 | 56 |
|
42 | 57 | - **Add New Data Sources**: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
|
43 | 58 | - **Suggest Enhancements**: If you have ideas to improve a data source or the API, we'd love to hear them!
|
44 | 59 | - **Report Bugs**: Found something that doesn't work as expected? Let us know by opening an issue.
|
45 | 60 |
|
46 |
| -**Need help or have questions?** Don't hesitate to open a new issue, and we'll do our best to assist you. |
47 | 61 |
|
48 | 62 | ## Development
|
49 |
| - |
| 63 | +### Environment Setup |
50 | 64 | ```
|
51 | 65 | poetry install --all-extras
|
52 | 66 | poetry shell
|
53 | 67 | ```
|
54 | 68 |
|
55 |
| -### Install PySpark from the latest Spark master |
56 |
| -- Clone the Apache Spark repo: `git clone [email protected]:apache/spark.git` |
57 |
| -- Build Spark: `build/sbt clean package` |
58 |
| -- Build PySpark: `cd python/packaging/classic && python setup.py sdist` |
59 |
| -- Install PySpark: `poetry run pip install <path-to-spark-repo>/python/dist/pyspark-4.1.0.dev0.tar.gz` |
60 |
| - |
61 |
| - |
62 |
| -### Build docs |
| 69 | +### Build Docs |
63 | 70 | ```
|
64 | 71 | mkdocs serve
|
65 | 72 | ```
|
0 commit comments