Skip to content

Commit 1b5ee59

Browse files
committed
add code block filename background color
1 parent a161156 commit 1b5ee59

File tree

3 files changed

+154
-0
lines changed

3 files changed

+154
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
authors:
3+
- copdips
4+
categories:
5+
- python
6+
- spark
7+
- database
8+
comments: true
9+
date:
10+
created: 2024-11-14
11+
---
12+
13+
# PySpark database connectors
14+
15+
## General
16+
17+
Use `spark.jars` to add local ODBC/JDBC drivers to PySpark, and use `spark.jars.packages` to add remote ODBC/JDBC drivers, PySpark will download the packages from Maven repository.
18+
19+
For `spark-shell`: https://docs.snowflake.com/en/user-guide/spark-connector-install#installing-additional-packages-if-needed
20+
21+
```python linenums="1" hl_lines="5 9"
22+
from pyspark.sql import SparkSession
23+
24+
spark = (
25+
SparkSession.builder.config(
26+
"spark.jars",
27+
"/home/xiang/src/sqljdbc_12.8/enu/jars/mssql-jdbc-12.8.1.jre11.jar",
28+
)
29+
.config(
30+
"spark.jars.packages",
31+
"net.snowflake:snowflake-jdbc:3.13.22,net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.4",
32+
)
33+
.getOrCreate()
34+
)
35+
```
36+
37+
!!! note "In Databricks, normally we don't need to add ODBC/JDBC drivers manually, as we can configure the cluster (in the Advanced Options tab) to install the drivers automatically."
38+
39+
!!! note "pay attention to the compatibility among JAVA version, spark version, PySpark version, and the JDBC driver version."
40+
Normally, jdk 11 is good for Spark 3.4 and 3.5, so as to PySpark, but many JDBC drivers are not compatible with Spark 3.5 yet as of 2024. So everything around Spark 3.4 is a good choice.
41+
42+
## Microsoft SQL Server
43+
44+
1. Download the JDBC driver from [Microsoft](https://learn.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server). Suppose it's downloaded to `~/src/sqljdbc_12.8.1.0_enu.tar.gz`.
45+
2. `cd ~/src && tar -xvf sqljdbc_12.8.1.0_enu.tar.gz`
46+
3. Add jdbc driver as spark jars in PySpark code:
47+
48+
```python
49+
from pyspark.sql import SparkSession
50+
51+
spark = (
52+
SparkSession.builder.config(
53+
"spark.jars",
54+
"/home/xiang/src/sqljdbc_12.8/enu/jars/mssql-jdbc-12.8.1.jre11.jar",
55+
)
56+
.getOrCreate()
57+
)
58+
59+
"""
60+
# or:
61+
62+
spark = (
63+
SparkSession.builder
64+
.config(
65+
"spark.jars.packages",
66+
"com.microsoft.sqlserver:mssql-jdbc:12.8.1.jre11",
67+
# old not maintained driver: "com.microsoft.azure:spark-mssql-connector_2.12:1.3.0-BETA",
68+
)
69+
.getOrCreate()
70+
)
71+
72+
# or:
73+
74+
spark = (
75+
SparkSession.builder
76+
.config(
77+
"spark.driver.extraClassPath",
78+
"/home/xiang/src/sqljdbc_12.8/enu/jars/mssql-jdbc-12.8.1.jre11.jar",
79+
)
80+
.config(
81+
"spark.executor.extraClassPath",
82+
"/home/xiang/src/sqljdbc_12.8/enu/jars/mssql-jdbc-12.8.1.jre11.jar",
83+
)
84+
.getOrCreate()
85+
)
86+
""""
87+
88+
spark.read.format("jdbc")
89+
.options(
90+
# ! parameter `driver` is not needed for Databricks environment,
91+
# here is for local testing and only works when jar is specified by spark.jars, and not by spark.driver.extraClassPath, and spark.executor.extraClassPath
92+
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
93+
url=url,
94+
dbtable=f"dbo.my_table",
95+
authentication="SqlPassword",
96+
user=user,
97+
password=password,
98+
)
99+
.load()
100+
```
101+
102+
## Snowflake
103+
104+
Ref: <https://docs.snowflake.com/en/user-guide/spark-connector-install#step-4-configure-the-local-spark-cluster-or-amazon-emr-hosted-spark-environment>
105+
106+
```python
107+
# https://docs.databricks.com/en/connect/external-systems/snowflake.html
108+
from pyspark.sql import SparkSession
109+
110+
spark = (
111+
SparkSession.builder
112+
.config(
113+
# download on live the jdbc driver
114+
"spark.jars.packages",
115+
"net.snowflake:snowflake-jdbc:3.13.22,net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.4",
116+
)
117+
.getOrCreate()
118+
)
119+
120+
sf_params = {
121+
"sfURL": "account_name.snowflakecomputing.com",
122+
"sfUser": "user",
123+
"sfPassword":"password",
124+
"sfDatabase": "database",
125+
"sfSchema": "schema",
126+
"sfWarehouse": "warehouse",
127+
"sfRole": "role",
128+
}
129+
query = "select * from database.schema.table"
130+
spark.read.format("snowflake").options(**sf_params).option("query", query).load()
131+
```
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
authors:
3+
- copdips
4+
categories:
5+
- python
6+
- spark
7+
- databricks
8+
- performance
9+
comments: true
10+
date:
11+
created: 2024-12-14
12+
---
13+
14+
# Database performance
15+
16+
## Query optimization
17+
18+
- [Query databases using JDBC](https://docs.databricks.com/en/connect/external-systems/jdbc.html#)

docs/stylesheets/extra.css

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,11 @@ a.md-nav__link:hover {
118118
border-bottom-right-radius: 0; /* 底部右边框半径 */
119119
margin-bottom: 5px;
120120
text-align: center; /* 文本居中 */
121+
background-color: #edf3ff; /* 背景颜色 */
122+
123+
}
124+
[data-md-color-scheme="slate"] .highlight span.filename {
125+
background-color: #232c3f; /* 背景颜色 */
121126
}
122127

123128
.highlight span.filename+pre>code {

0 commit comments

Comments
 (0)