Skip to content

Commit a0e0fbc

Browse files
authored
Merge pull request #19 from oracle/2.6.8_docs
Adds Data Flow Studio section to the docs.
2 parents 88dd9b8 + 6e0d794 commit a0e0fbc

File tree

10 files changed

+336
-27
lines changed

10 files changed

+336
-27
lines changed

docs/source/index.rst

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,20 @@
22
.. meta::
33
:description lang=en:
44
Oracle Accelerated Data Science SDK (ORACLE-ADS)
5-
is a Python library that is part of the Oracle Cloud Infrastructure Data Science service. ORACLE-ADS is the client
6-
library and CLI for Machine learning engineers to work with Cloud Infrastructure (CPU and GPU VMs, Storage etc, Spark) for Data, Models,
5+
is a Python library that is part of the Oracle Cloud Infrastructure Data Science service. ORACLE-ADS is the client
6+
library and CLI for Machine learning engineers to work with Cloud Infrastructure (CPU and GPU VMs, Storage etc, Spark) for Data, Models,
77
Notebooks, Pipelines and Jobs.
88

99
Oracle Accelerated Data Science SDK (ADS)
1010
=========================================
11+
|PyPI|_ |Python|_ |Notebook Examples|_
12+
13+
.. |PyPI| image:: https://img.shields.io/pypi/v/oracle-ads.svg
14+
.. _PyPI: https://pypi.org/project/oracle-ads/
15+
.. |Python| image:: https://img.shields.io/pypi/pyversions/oracle-ads.svg?style=plastic
16+
.. _Python: https://pypi.org/project/oracle-ads/
17+
.. |Notebook Examples| image:: https://img.shields.io/badge/docs-notebook--examples-blue
18+
.. _Notebook Examples: https://github.com/oracle-samples/oci-data-science-ai-samples/tree/master/notebook_examples
1119

1220
.. toctree::
1321
:hidden:
@@ -81,11 +89,6 @@ Oracle Accelerated Data Science SDK (ADS)
8189

8290
python3 -m pip install oracle-ads
8391

84-
|PyPI|_
85-
86-
.. |PyPI| image:: https://img.shields.io/pypi/v/oracle-ads.svg
87-
.. _PyPI: https://pypi.org/project/oracle-ads/
88-
8992

9093
.. admonition:: Source Code
9194

@@ -138,7 +141,7 @@ Load data from Object Storage
138141
139142
140143
141-
Load data from Autonomous DB
144+
Load data from Autonomous DB
142145
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143146

144147
This example uses SQL injection safe binding variables.
@@ -165,7 +168,7 @@ This example uses SQL injection safe binding variables.
165168
connection_parameters=connection_parameters,
166169
)
167170
168-
More Examples
171+
More Examples
169172
~~~~~~~~~~~~~
170173

171174
See :doc:`quick start<user_guide/quick_start/quick_start>` guide for more examples
Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
* Data Catalog requires policies to be set in IAM. Refer to the Data Catalog documentation on how to `setup policies <https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm>`__.
2-
* The ``spark-defaults.conf`` file needs to be :ref:`configured<configuration-spark_defaults_conf>`.
32

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
* DataFlow requires a bucket to store the logs, and a data warehouse bucket. Refer to the Data Flow documentation for `setting up storage <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#set_up_storage>`_.
2-
* DataFlow requires policies to be set in IAM to access resources to manage and run applications. Refer to the Data Flow documentation on how to `setup policies <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_set_up>`__.
3-
* DataFlow natively supports conda packs published to OCI Object Storage. Ensure the Data Flow Resource has read access to the bucket or path of your published conda pack, and that the spark version >= 3 when running your Data Flow Application.
4-
* The ``core-site.xml`` file needs to be :ref:`configured<configuration-core_site_xml>`.
2+
* DataFlow requires policies to be set in IAM to access resources to manage and run applications/sessions. Refer to the Data Flow documentation on how to `setup policies <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_set_up>`__.
3+
* DataFlow natively supports conda packs published to OCI Object Storage. Ensure the Data Flow Resource has read access to the bucket or path of your published conda pack, and that the spark version >= 3 when running your Data Flow Application/Session.
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
####################
2+
OCI Data Flow Studio
3+
####################
4+
5+
This section demonstrates how to run interactive Spark workloads on a long lasting `Oracle Cloud Infrastructure Data Flow <https://docs.oracle.com/iaas/data-flow/using/home.htm>`__ cluster through `Apache Livy <https://livy.apache.org/>`__ integration.
6+
7+
**Data Flow Studio allows you to:**
8+
9+
* Run Spark code against a Data Flow remote Spark cluster
10+
* Create a Data Flow Spark session with SparkContext and HiveContext against a Data Flow remote Spark cluster
11+
* Capture the output of Spark queries as a local Pandas data frame to interact easily with other Python libraries (e.g. matplotlib)
12+
13+
**Key Features & Benefits:**
14+
15+
* Data Flow sessions support auto-scaling Data Flow cluster capabilities
16+
* Data Flow sessions support the use of conda environments as customizable Spark runtime environments
17+
18+
**Limitations:**
19+
20+
* Data Flow sessions can last up to 7 days or 10,080 mins (maxDurationInMinutes).
21+
* Data Flow Sessions can only be accessed through OCI Data Science Notebook Sessions.
22+
* Not all SparkMagic commands are currently supported. To see the full list, run the ``%help`` command in a notebook cell.
23+
24+
**Notebook Examples:**
25+
26+
* `Introduction to the Oracle Cloud Infrastructure Data Flow Studio <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/notebook_examples/pyspark-data_flow_studio-introduction.ipynb>`__
27+
* `Spark NLP within Oracle Cloud Infrastructure Data Flow Studio <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/notebook_examples/pyspark-data_flow_studio-spark_nlp.ipynb>`__
28+
29+
Prerequisite
30+
============
31+
Data Flow Sessions are accessible through the following conda environment:
32+
33+
* PySpark 3.2 and Data Flow 1.0 (pyspark32_p38_cpu_v1)
34+
35+
You can customize **pypspark32_p38_cpu_v1**, publish it, and use it as a runtime environment for a Data Flow Session.
36+
37+
Policies
38+
********
39+
40+
Data Flow requires policies to be set in IAM to access resources to manage and run sessions. Refer to the `Data Flow Studio Policies <https://docs.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policies-data-flow-studio>`__ documentation on how to setup policies.
41+
42+
43+
Quick Start
44+
===========
45+
46+
.. code-block:: python
47+
48+
import ads
49+
ads.set_auth("resource_principal")
50+
51+
%load_ext dataflow.magics
52+
53+
%create_session -l python -c '{\
54+
"compartmentId":"<compartment_id>",\
55+
"displayName":"TestDataFlowSession",\
56+
"sparkVersion":"3.2.1",\
57+
"driverShape":"VM.Standard.E4.Flex",\
58+
"executorShape":"VM.Standard.E4.Flex",\
59+
"numExecutors":1,\
60+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
61+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
62+
"logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
63+
64+
%%spark
65+
print(sc.version)
66+
67+
Data Flow Spark Magic
68+
=====================
69+
Data Flow Spark Magic is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It is a JupyterLab extension that you need to activate in your notebook.
70+
71+
.. code-block:: python
72+
73+
%load_ext dataflow.magics
74+
75+
Use the `%help` method to get a list of all the available commands, along with a list of their arguments and example calls.
76+
77+
.. code-block:: python
78+
79+
%help
80+
81+
.. admonition:: Tip
82+
83+
To access the docstrings of any magic command and figure out what arguments to provide, simply add ``?`` at the end of the command. For instance: ``%create_session?``
84+
85+
Create Session
86+
**************
87+
88+
**Example command for Flex shapes**
89+
90+
To create a new Data Flow cluster session use the ``%create_session`` magic command.
91+
92+
.. code-block:: python
93+
94+
%create_session -l python -c '{\
95+
"compartmentId":"<compartment_id>",\
96+
"displayName":"TestDataFlowSession",\
97+
"sparkVersion":"3.2.1",\
98+
"driverShape":"VM.Standard.E4.Flex",\
99+
"executorShape":"VM.Standard.E4.Flex",\
100+
"numExecutors":1,\
101+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
102+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
103+
"logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
104+
105+
**Example command for Spark dynamic allocation (aka auto-scaling)**
106+
107+
To help you save resources and reduce time on management, Spark `dynamic allocation <https://docs.oracle.com/iaas/data-flow/using/dynamic-alloc-about.htm#dynamic-alloc-about>`__ is now enabled in Data Flow. You can define a Data Flow cluster based on a range of executors, instead of just a fixed number of executors. Spark provides a mechanism to dynamically adjust the resources the application occupies based on the workload. The application might relinquish resources if they are no longer used and request them again later when there is demand.
108+
109+
.. code-block:: python
110+
111+
%create_session -l python -c '{\
112+
"compartmentId":"<compartment_id>",\
113+
"displayName":"TestDataFlowSession",\
114+
"sparkVersion":"3.2.1",\
115+
"driverShape":"VM.Standard.E4.Flex",\
116+
"executorShape":"VM.Standard.E4.Flex",\
117+
"numExecutors":1,\
118+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
119+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
120+
"logsBucketUri" : "oci://<bucket_name>@<namespace>/"\
121+
"configuration":{\
122+
"spark.dynamicAllocation.enabled":"true",\
123+
"spark.dynamicAllocation.shuffleTracking.enabled":"true",\
124+
"spark.dynamicAllocation.minExecutors":"1",\
125+
"spark.dynamicAllocation.maxExecutors":"4",\
126+
"spark.dynamicAllocation.executorIdleTimeout":"60",\
127+
"spark.dynamicAllocation.schedulerBacklogTimeout":"60",\
128+
"spark.dataflow.dynamicAllocation.quotaPolicy":"min"}}'
129+
130+
**Example command with third-party libraries**
131+
132+
The Data Flow Sessions support `custom dependencies <https://docs.oracle.com/iaas/data-flow/using/third-party-libraries.htm>`__ in the form of Python wheels or virtual environments. You might want to make native code or other assets available within your Spark runtime. The dependencies can be attached by using the `archiveUri` attribute.
133+
134+
.. code-block:: python
135+
136+
%create_session -l python -c '{\
137+
"compartmentId":"<compartment_id>",\
138+
"displayName":"TestDataFlowSession",\
139+
"sparkVersion":"3.2.1",\
140+
"driverShape":"VM.Standard.E4.Flex",\
141+
"executorShape":"VM.Standard.E4.Flex",\
142+
"numExecutors":1,\
143+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
144+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
145+
"archiveUri":"oci://<bucket_name>@<namespace>/<zip_archive>",\
146+
"logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
147+
148+
**Example command with the Data Catalog Hive Metastore**
149+
150+
The `Data Catalog Hive Metastore <https://docs.oracle.com/iaas/data-catalog/using/metastore.htm>`__ provides schema definitions for objects in structured and unstructured data assets. Use the `metastoreId` to access the Data Catalog Metastore.
151+
152+
.. code-block:: python
153+
154+
%create_session -l python -c '{\
155+
"compartmentId":"<compartment_id>",\
156+
"displayName":"TestDataFlowSession",\
157+
"sparkVersion":"3.2.1",\
158+
"driverShape":"VM.Standard.E4.Flex",\
159+
"executorShape":"VM.Standard.E4.Flex",\
160+
"numExecutors":1,\
161+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
162+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
163+
"metastoreId": "<ocid1.datacatalogmetastore...>",\
164+
"logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
165+
166+
**Example command with the published conda environment**
167+
168+
You can use a published conda environment as a Data Flow runtime environment.
169+
170+
* `Creating a Custom Conda Environment <https://docs.oracle.com/iaas/data-science/using/conda_create_conda_env.htm>`__
171+
* `How to create a new conda environment in OCI Data Science <https://blogs.oracle.com/ai-and-datascience/post/creating-a-new-conda-environment-from-scratch-in-oci-data-science>`__
172+
* `Publishing a Conda Environment to an Object Storage Bucket in Your Tenancy <https://docs.oracle.com/en-us/iaas/data-science/using/conda_publishs_object.htm#:~:text=You%20can%20publish%20a%20conda%20environment%20that%20you%20have%20installed,persist%20them%20across%20notebook%20sessions.>`__
173+
174+
The path to the published conda environment can be copied from the `Environment Explorer <https://docs.oracle.com/iaas/data-science/using/conda_viewing.htm>`__.
175+
176+
Example path : ``oci://<your-bucket>@<your-tenancy-namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda``
177+
178+
.. code-block:: python
179+
180+
%create_session -l python -c '{\
181+
"compartmentId":"<compartment_id>",\
182+
"displayName":"TestDataFlowSession",\
183+
"sparkVersion":"3.2.1",\
184+
"driverShape":"VM.Standard.E4.Flex",\
185+
"executorShape":"VM.Standard.E4.Flex",\
186+
"numExecutors":1,\
187+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
188+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
189+
"logsBucketUri" : "oci://<bucket_name>@<namespace>/"\
190+
"configuration":{\
191+
"spark.archives": "oci://<your-bucket>@<your-tenancy-namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda>"}}'
192+
193+
194+
Update Session
195+
**************
196+
197+
You can modify the configuration of your running session using the ``%update_session`` command. For example, Data Flow sessions can last up to 7 days or 10080 mins (168 hours) (**maxDurationInMinutes**) and have default idle timeout value of 480 mins (8 hours)(**idleTimeoutInMinutes**). Only those two can be updated on a running cluster without re-creating the cluster.
198+
199+
.. code-block:: python
200+
201+
%update_session -i '{"maxDurationInMinutes": 1440, "idleTimeoutInMinutes": 420}'
202+
203+
Configure Session
204+
*****************
205+
206+
The existing session can be reconfigured with the ``%configure_session`` command. The new configuration will be applied the next time the session is started. Use the force flag ``-f`` to immediately drop and recreate the running cluster session.
207+
208+
.. code-block:: python
209+
210+
%configure_session -f -i '{\
211+
"driverShape":"VM.Standard.E4.Flex",\
212+
"executorShape":"VM.Standard.E4.Flex",\
213+
"numExecutors":2,\
214+
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
215+
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16}}'
216+
217+
Stop Session
218+
************
219+
To stop the current session, use the ``%stop_session`` magic command. You don't need to provide any arguments for this command. The current active cluster will be stopped. All data in memory will be lost.
220+
221+
.. code-block:: python
222+
223+
%stop_session
224+
225+
Activate Session
226+
****************
227+
To re-activate the existing session, use the ``%activate_session`` magic command. The ``application_id`` can be taken from the console UI.
228+
229+
.. code-block:: python
230+
231+
%activate_session -l python -c '{\
232+
"compartmentId":"<compartment_id>",\
233+
"displayName":"TestDataFlowSession",\
234+
"applicationId":"<application_id>"}'
235+
236+
Use Existing Session
237+
********************
238+
To connect to the existing session use the `%use_session` magic command.
239+
240+
.. code-block:: python
241+
242+
%use_session -s "<application_id>"
243+
244+
245+
Basic Spark Usage Examples
246+
==========================
247+
A SparkContext (``sc``) and HiveContext (``sqlContext``) are automatically created in the session cluster. The magic commands include the ``%%spark`` command to run Spark commands in the cluster. You can access information about the Spark application, define a dataframe where results are to be stored, modify the configuration, and so on.
248+
249+
The ``%%spark`` magic command comes with a number of parameters that allow you to interact with the Data Flow Spark cluster. Any cell content that starts with the ``%%spark`` command will be executed in the remote Spark cluster.
250+
251+
Check the Spark context version:
252+
253+
.. code-block:: python
254+
255+
%%spark
256+
print(sc.version)
257+
258+
259+
A toy example of how to use ``sc`` in a Data Flow Spark Magic cell:
260+
261+
.. code-block:: python
262+
263+
%%spark
264+
numbers = sc.parallelize([4, 3, 2, 1])
265+
print(f"First element of numbers is {numbers.first()}")
266+
print(f"The RDD, numbers, has the following description\n{numbers.toDebugString()}")
267+
268+
Spark SQL
269+
*********
270+
Using the ``-c sql`` option allows you to run Spark SQL commands in a cell. In this section, the `NYC Taxi and Limousine Commission (TLC) Data <https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`__ dataset is used. The size of the dataset is around **35GB**.
271+
272+
The next cell reads the dataset into a Spark dataframe, and then saves it as a view used to demonstrate Spark SQL.
273+
274+
275+
Use the ``-c sql`` option to run Spark SQL commands in a cell.
276+
277+
The next example demonstrates how a dataset can be created on the fly:
278+
279+
.. code-block:: python
280+
281+
%%spark
282+
df_nyc_tlc = spark.read.parquet("oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet", header=False, inferSchema=True)
283+
df_nyc_tlc.show()
284+
df_nyc_tlc.createOrReplaceTempView("nyc_tlc")
285+
286+
The following cell uses the ``-c sql`` option to tell Data Flow Spark Magic that the contents of the cell is SparkSQL. The ``-o <variable>`` option takes the results of the Spark SQL operation and stores it in the defined variable. In this case, the ``df_people`` will be a Pandas dataframe that is available to be used in the notebook.
287+
288+
.. code-block:: python
289+
290+
%%spark -c sql -o df_nyc_tlc
291+
SELECT vendor_id, passenger_count, trip_distance, payment_type FROM nyc_tlc LIMIT 1000;
292+
293+
Check the result:
294+
295+
.. code-block:: python
296+
297+
print(type(df_nyc_tlc))
298+
df_nyc_tlc.head()

docs/source/user_guide/apachespark/spark.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@ Apache Spark
1414

1515
.. toctree::
1616
:maxdepth: 2
17-
17+
1818
quickstart
1919
setup-installation
2020
dataflow
2121
spark-defaults_conf
2222
../data_catalog_metastore/index
23+
dataflow-spark-magic
2324
../data_flow/legacy_dataflow
2425

docs/source/user_guide/cli/authentication.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ You can choose to use the resource principal to authenticate while using the Acc
2121

2222
.. code-block:: python
2323
24-
import ads
24+
import ads
2525
ads.set_auth(auth='resource_principal')
2626
compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID']
2727
pc = ProjectCatalog(compartment_id=compartment_id)
@@ -39,7 +39,7 @@ Use API Key setup when you are working from a local workstation or on platform w
3939

4040
This is the default method of authentication. You can also authenticate as your own personal IAM user by creating or uploading OCI configuration and API key files inside your notebook session environment. The OCI configuration file contains the necessary credentials to authenticate your user against the model catalog and other OCI services like Object Storage. The example notebook, `api_keys.ipynb` demonstrates how to create these files.
4141

42-
You can follow the steps in `api_keys.ipynb <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/ads_notebooks/api_keys.ipynb>` for step by step instruction on setting up API Keys.
42+
You can follow the steps in `api_keys.ipynb <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/notebook_examples/api_keys.ipynb>` for step by step instruction on setting up API Keys.
4343

4444
.. note::
4545
If you already have an OCI configuration file (``config``) and associated keys, you can upload them directly to the ``/home/datascience/.oci`` directory using the JupyterLab **Upload Files** or the drag-and-drop option.

0 commit comments

Comments
 (0)