Skip to content

Commit 372d3c7

Browse files
update docs
1 parent 7f114d3 commit 372d3c7

File tree

3 files changed

+98
-134
lines changed

3 files changed

+98
-134
lines changed

README.md

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# pyspark-data-sources
1+
# PySpark Data Sources
22

33
[![pypi](https://img.shields.io/pypi/v/pyspark-data-sources.svg?color=blue)](https://pypi.org/project/pyspark-data-sources/)
44

55
This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://spark.apache.org/docs/4.0.0/api/python/tutorial/sql/python_data_source.html) introduced in Apache Spark 4.0.
66
For an in-depth understanding of the API, please refer to the [API source code](https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py).
7-
Note this repo is **demo only** and please be aware that it is not intended for production use.
7+
Note this repo is demo only and please be aware that it is not intended for production use.
88
Contributions and feedback are welcome to help improve the examples.
99

1010

@@ -30,26 +30,31 @@ from pyspark_datasources.fake import FakeDataSource
3030
spark.dataSource.register(FakeDataSource)
3131

3232
spark.read.format("fake").load().show()
33+
34+
# For streaming data generation
35+
spark.readStream.format("fake").load().writeStream.format("console").start()
3336
```
3437

3538
## Example Data Sources
3639

37-
| Data Source | Short Name | Description | Dependencies |
38-
|-------------|------------|-------------|--------------|
39-
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
40-
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | None |
41-
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
42-
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Read JSON data from a file | `databricks-sdk` |
40+
| Data Source | Short Name | Description | Dependencies |
41+
|-------------------------------------------------------------------------|----------------|-----------------------------------------------|-----------------------|
42+
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
43+
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | `faker` |
44+
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
45+
| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Read table from public Google Sheets | None |
46+
| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` |
47+
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` |
4348

4449
See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
4550

4651
## Official Data Sources
4752

4853
For production use, consider these official data source implementations built with the Python Data Source API:
4954

50-
| Data Source | Repository | Description | Features |
51-
|-------------|------------|-------------|----------|
52-
| **HuggingFace Datasets** | [@huggingface/pyspark_huggingface](https://github.com/huggingface/pyspark_huggingface) | Production-ready Spark Data Source for 🤗 Hugging Face Datasets | • Stream datasets as Spark DataFrames<br>• Select subsets/splits with filters<br>• Authentication support<br>• Save DataFrames to Hugging Face<br> |
55+
| Data Source | Repository | Description | Features |
56+
|--------------------------|-----------------------------------------------------------------------------------------------|----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
57+
| **HuggingFace Datasets** | [@huggingface/pyspark_huggingface](https://github.com/huggingface/pyspark_huggingface) | Production-ready Spark Data Source for 🤗 Hugging Face Datasets | • Stream datasets as Spark DataFrames<br>• Select subsets/splits with filters<br>• Authentication support<br>• Save DataFrames to Hugging Face<br> |
5358

5459
## Contributing
5560
We welcome and appreciate any contributions to enhance and expand the custom data sources.:
@@ -62,8 +67,8 @@ We welcome and appreciate any contributions to enhance and expand the custom dat
6267
## Development
6368
### Environment Setup
6469
```
65-
poetry install --all-extras
66-
poetry shell
70+
poetry install
71+
poetry env activate
6772
```
6873

6974
### Build Docs

docs/index.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,15 @@ pip install pyspark-data-sources[all]
1717
## Usage
1818

1919
```python
20-
from pyspark_datasources import GithubDataSource
20+
from pyspark_datasources.fake import FakeDataSource
2121

2222
# Register the data source
23-
spark.dataSource.register(GithubDataSource)
23+
spark.dataSource.register(FakeDataSource)
2424

25-
spark.read.format("github").load("apache/spark").show()
25+
spark.read.format("fake").load().show()
26+
27+
# For streaming data generation
28+
spark.readStream.format("fake").load().writeStream.format("console").start()
2629
```
2730

2831

@@ -34,6 +37,6 @@ spark.read.format("github").load("apache/spark").show()
3437
| [FakeDataSource](./datasources/fake.md) | `fake` | Generate fake data using the `Faker` library | `faker` |
3538
| [HuggingFaceDatasets](./datasources/huggingface.md) | `huggingface` | Read datasets from the HuggingFace Hub | `datasets` |
3639
| [StockDataSource](./datasources/stock.md) | `stock` | Read stock data from Alpha Vantage | None |
37-
| [SimpleJsonDataSource](./datasources/simplejson.md) | `simplejson` | Read JSON data from a file | `databricks-sdk` |
40+
| [SimpleJsonDataSource](./datasources/simplejson.md) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` |
3841
| [GoogleSheetsDataSource](./datasources/googlesheets.md) | `googlesheets` | Read table from public Google Sheets document | None |
3942
| [KaggleDataSource](./datasources/kaggle.md) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` |

0 commit comments

Comments
 (0)