oracle
diff --git a/‎docs/source/index.rst
Lines changed: 12 additions & 9 deletions b/‎docs/source/index.rst
Lines changed: 12 additions & 9 deletions
diff --git a/‎docs/source/user_guide/_template/prerequisite/data_catalog.rst
Lines changed: 0 additions & 1 deletion b/‎docs/source/user_guide/_template/prerequisite/data_catalog.rst
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/user_guide/_template/prerequisite/data_flow.rst
Lines changed: 2 additions & 3 deletions b/‎docs/source/user_guide/_template/prerequisite/data_flow.rst
Lines changed: 2 additions & 3 deletions
diff --git a/‎docs/source/user_guide/apachespark/dataflow-spark-magic.rst
Lines changed: 298 additions & 0 deletions b/‎docs/source/user_guide/apachespark/dataflow-spark-magic.rst
Lines changed: 298 additions & 0 deletions
diff --git a/‎docs/source/user_guide/apachespark/spark.rst
Lines changed: 2 additions & 1 deletion b/‎docs/source/user_guide/apachespark/spark.rst
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/source/user_guide/cli/authentication.rst
Lines changed: 2 additions & 2 deletions b/‎docs/source/user_guide/cli/authentication.rst
Lines changed: 2 additions & 2 deletions
@@ -2,12 +2,20 @@
 .. meta::
     :description lang=en:
         Oracle Accelerated Data Science SDK (ORACLE-ADS)
-        is a Python library that is part of the Oracle Cloud Infrastructure Data Science service. ORACLE-ADS is the client 
-        library and CLI for Machine learning engineers to work with Cloud Infrastructure (CPU and GPU VMs, Storage etc, Spark) for Data, Models, 
+        is a Python library that is part of the Oracle Cloud Infrastructure Data Science service. ORACLE-ADS is the client
+        library and CLI for Machine learning engineers to work with Cloud Infrastructure (CPU and GPU VMs, Storage etc, Spark) for Data, Models,
         Notebooks, Pipelines and Jobs.
 
 Oracle Accelerated Data Science SDK (ADS)
 =========================================
+|PyPI|_ |Python|_ |Notebook Examples|_
+
+.. |PyPI| image:: https://img.shields.io/pypi/v/oracle-ads.svg
+..  _PyPI: https://pypi.org/project/oracle-ads/
+.. |Python| image:: https://img.shields.io/pypi/pyversions/oracle-ads.svg?style=plastic
+..  _Python: https://pypi.org/project/oracle-ads/
+.. |Notebook Examples| image:: https://img.shields.io/badge/docs-notebook--examples-blue
+..  _Notebook Examples: https://github.com/oracle-samples/oci-data-science-ai-samples/tree/master/notebook_examples
 
 .. toctree::
    :hidden:
@@ -81,11 +89,6 @@ Oracle Accelerated Data Science SDK (ADS)
 
    python3 -m pip install oracle-ads
 
-   |PyPI|_
-
-.. |PyPI| image:: https://img.shields.io/pypi/v/oracle-ads.svg
-..  _PyPI: https://pypi.org/project/oracle-ads/
-
 
 .. admonition:: Source Code
 
@@ -138,7 +141,7 @@ Load data from Object Storage
 
 
 
-Load data from Autonomous DB 
+Load data from Autonomous DB
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This example uses SQL injection safe binding variables.
@@ -165,7 +168,7 @@ This example uses SQL injection safe binding variables.
       connection_parameters=connection_parameters,
   )
 
-More Examples 
+More Examples
 ~~~~~~~~~~~~~
 
 See :doc:`quick start<user_guide/quick_start/quick_start>` guide for more examples
 
@@ -1,3 +1,2 @@
 * Data Catalog requires policies to be set in IAM. Refer to the Data Catalog documentation on how to `setup policies <https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm>`__.
-* The ``spark-defaults.conf`` file needs to be :ref:`configured<configuration-spark_defaults_conf>`.
 
@@ -1,4 +1,3 @@
 * DataFlow requires a bucket to store the logs, and a data warehouse bucket. Refer to the Data Flow documentation for `setting up storage <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#set_up_storage>`_.
-* DataFlow requires policies to be set in IAM to access resources to manage and run applications. Refer to the Data Flow documentation on how to `setup policies <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_set_up>`__.
-* DataFlow natively supports conda packs published to OCI Object Storage. Ensure the Data Flow Resource has read access to the bucket or path of your published conda pack, and that the spark version >= 3 when running your Data Flow Application.
-* The ``core-site.xml`` file needs to be :ref:`configured<configuration-core_site_xml>`.
+* DataFlow requires policies to be set in IAM to access resources to manage and run applications/sessions. Refer to the Data Flow documentation on how to `setup policies <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_set_up>`__.
+* DataFlow natively supports conda packs published to OCI Object Storage. Ensure the Data Flow Resource has read access to the bucket or path of your published conda pack, and that the spark version >= 3 when running your Data Flow Application/Session.
@@ -0,0 +1,298 @@
+####################
+OCI Data Flow Studio
+####################
+
+This section demonstrates how to run interactive Spark workloads on a long lasting `Oracle Cloud Infrastructure Data Flow <https://docs.oracle.com/iaas/data-flow/using/home.htm>`__ cluster through `Apache Livy <https://livy.apache.org/>`__ integration.
+
+**Data Flow Studio allows you to:**
+
+* Run Spark code against a Data Flow remote Spark cluster
+* Create a Data Flow Spark session with SparkContext and HiveContext against a Data Flow remote Spark cluster
+* Capture the output of Spark queries as a local Pandas data frame to interact easily with other Python libraries (e.g. matplotlib)
+
+**Key Features & Benefits:**
+
+* Data Flow sessions support auto-scaling Data Flow cluster capabilities
+* Data Flow sessions support the use of conda environments as customizable Spark runtime environments
+
+**Limitations:**
+
+* Data Flow sessions can last up to 7 days or 10,080 mins (maxDurationInMinutes).
+* Data Flow Sessions can only be accessed through OCI Data Science Notebook Sessions.
+* Not all SparkMagic commands are currently supported. To see the full list, run the ``%help`` command in a notebook cell.
+
+**Notebook Examples:**
+
+* `Introduction to the Oracle Cloud Infrastructure Data Flow Studio <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/notebook_examples/pyspark-data_flow_studio-introduction.ipynb>`__
+* `Spark NLP within Oracle Cloud Infrastructure Data Flow Studio <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/notebook_examples/pyspark-data_flow_studio-spark_nlp.ipynb>`__
+
+Prerequisite
+============
+Data Flow Sessions are accessible through the following conda environment:
+
+* PySpark 3.2 and Data Flow 1.0 (pyspark32_p38_cpu_v1)
+
+You can customize **pypspark32_p38_cpu_v1**, publish it, and use it as a runtime environment for a Data Flow Session.
+
+Policies
+********
+
+Data Flow requires policies to be set in IAM to access resources to manage and run sessions. Refer to the `Data Flow Studio Policies <https://docs.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policies-data-flow-studio>`__ documentation on how to setup policies.
+
+
+Quick Start
+===========
+
+.. code-block:: python
+
+  import ads
+  ads.set_auth("resource_principal")
+
+  %load_ext dataflow.magics
+
+  %create_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "sparkVersion":"3.2.1",\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":1,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
+
+  %%spark
+  print(sc.version)
+
+Data Flow Spark Magic
+=====================
+Data Flow Spark Magic is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It is a JupyterLab extension that you need to activate in your notebook.
+
+.. code-block:: python
+
+  %load_ext dataflow.magics
+
+Use the `%help` method to get a list of all the available commands, along with a list of their arguments and example calls.
+
+.. code-block:: python
+
+  %help
+
+.. admonition:: Tip
+
+  To access the docstrings of any magic command and figure out what arguments to provide, simply add ``?`` at the end of the command. For instance: ``%create_session?``
+
+Create Session
+**************
+
+**Example command for Flex shapes**
+
+To create a new Data Flow cluster session use the ``%create_session`` magic command.
+
+.. code-block:: python
+
+  %create_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "sparkVersion":"3.2.1",\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":1,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
+
+**Example command for Spark dynamic allocation (aka auto-scaling)**
+
+To help you save resources and reduce time on management, Spark `dynamic allocation <https://docs.oracle.com/iaas/data-flow/using/dynamic-alloc-about.htm#dynamic-alloc-about>`__ is now enabled in Data Flow. You can define a Data Flow cluster based on a range of executors, instead of just a fixed number of executors. Spark provides a mechanism to dynamically adjust the resources the application occupies based on the workload. The application might relinquish resources if they are no longer used and request them again later when there is demand.
+
+.. code-block:: python
+
+  %create_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "sparkVersion":"3.2.1",\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":1,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "logsBucketUri" : "oci://<bucket_name>@<namespace>/"\
+    "configuration":{\
+      "spark.dynamicAllocation.enabled":"true",\
+        "spark.dynamicAllocation.shuffleTracking.enabled":"true",\
+        "spark.dynamicAllocation.minExecutors":"1",\
+        "spark.dynamicAllocation.maxExecutors":"4",\
+        "spark.dynamicAllocation.executorIdleTimeout":"60",\
+        "spark.dynamicAllocation.schedulerBacklogTimeout":"60",\
+        "spark.dataflow.dynamicAllocation.quotaPolicy":"min"}}'
+
+**Example command with third-party libraries**
+
+The Data Flow Sessions support `custom dependencies <https://docs.oracle.com/iaas/data-flow/using/third-party-libraries.htm>`__ in the form of Python wheels or virtual environments. You might want to make native code or other assets available within your Spark runtime. The dependencies can be attached by using the `archiveUri` attribute.
+
+.. code-block:: python
+
+  %create_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "sparkVersion":"3.2.1",\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":1,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "archiveUri":"oci://<bucket_name>@<namespace>/<zip_archive>",\
+    "logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
+
+**Example command with the Data Catalog Hive Metastore**
+
+The `Data Catalog Hive Metastore <https://docs.oracle.com/iaas/data-catalog/using/metastore.htm>`__  provides schema definitions for objects in structured and unstructured data assets. Use the `metastoreId` to access the Data Catalog Metastore.
+
+.. code-block:: python
+
+  %create_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "sparkVersion":"3.2.1",\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":1,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "metastoreId": "<ocid1.datacatalogmetastore...>",\
+    "logsBucketUri" : "oci://<bucket_name>@<namespace>/"}'
+
+**Example command with the published conda environment**
+
+You can use a published conda environment as a Data Flow runtime environment.
+
+* `Creating a Custom Conda Environment <https://docs.oracle.com/iaas/data-science/using/conda_create_conda_env.htm>`__
+* `How to create a new conda environment in OCI Data Science <https://blogs.oracle.com/ai-and-datascience/post/creating-a-new-conda-environment-from-scratch-in-oci-data-science>`__
+* `Publishing a Conda Environment to an Object Storage Bucket in Your Tenancy <https://docs.oracle.com/en-us/iaas/data-science/using/conda_publishs_object.htm#:~:text=You%20can%20publish%20a%20conda%20environment%20that%20you%20have%20installed,persist%20them%20across%20notebook%20sessions.>`__
+
+The path to the published conda environment can be copied from the `Environment Explorer <https://docs.oracle.com/iaas/data-science/using/conda_viewing.htm>`__.
+
+Example path : ``oci://<your-bucket>@<your-tenancy-namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda``
+
+.. code-block:: python
+
+  %create_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "sparkVersion":"3.2.1",\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":1,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "logsBucketUri" : "oci://<bucket_name>@<namespace>/"\
+    "configuration":{\
+      "spark.archives": "oci://<your-bucket>@<your-tenancy-namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/1.0/pyspark32_p38_cpu_v1#conda>"}}'
+
+
+Update Session
+**************
+
+You can modify the configuration of your running session using the ``%update_session`` command. For example, Data Flow sessions can last up to 7 days or 10080 mins (168 hours) (**maxDurationInMinutes**) and have default idle timeout value of 480 mins (8 hours)(**idleTimeoutInMinutes**). Only those two can be updated on a running cluster without re-creating the cluster.
+
+.. code-block:: python
+
+  %update_session -i '{"maxDurationInMinutes": 1440, "idleTimeoutInMinutes": 420}'
+
+Configure Session
+*****************
+
+The existing session can be reconfigured with the ``%configure_session`` command. The new configuration will be applied the next time the session is started. Use the force flag ``-f`` to immediately drop and recreate the running cluster session.
+
+.. code-block:: python
+
+  %configure_session -f -i '{\
+    "driverShape":"VM.Standard.E4.Flex",\
+    "executorShape":"VM.Standard.E4.Flex",\
+    "numExecutors":2,\
+    "driverShapeConfig":{"ocpus":1,"memoryInGBs":16},\
+    "executorShapeConfig":{"ocpus":1,"memoryInGBs":16}}'
+
+Stop Session
+************
+To stop the current session, use the ``%stop_session`` magic command. You don't need to provide any arguments for this command. The current active cluster will be stopped. All data in memory will be lost.
+
+.. code-block:: python
+
+  %stop_session
+
+Activate Session
+****************
+To re-activate the existing session, use the ``%activate_session`` magic command. The ``application_id`` can be taken from the console UI.
+
+.. code-block:: python
+
+  %activate_session -l python -c '{\
+    "compartmentId":"<compartment_id>",\
+    "displayName":"TestDataFlowSession",\
+    "applicationId":"<application_id>"}'
+
+Use Existing Session
+********************
+To connect to the existing session use the `%use_session` magic command.
+
+.. code-block:: python
+
+  %use_session -s "<application_id>"
+
+
+Basic Spark Usage Examples
+==========================
+A SparkContext (``sc``) and HiveContext (``sqlContext``) are automatically created in the session cluster. The magic commands include the ``%%spark`` command to run Spark commands in the cluster. You can access information about the Spark application, define a dataframe where results are to be stored, modify the configuration, and so on.
+
+The ``%%spark`` magic command comes with a number of parameters that allow you to interact with the Data Flow Spark cluster. Any cell content that starts with the ``%%spark`` command will be executed in the remote Spark cluster.
+
+Check the Spark context version:
+
+.. code-block:: python
+
+  %%spark
+  print(sc.version)
+
+
+A toy example of how to use ``sc`` in a Data Flow Spark Magic cell:
+
+.. code-block:: python
+
+  %%spark
+  numbers = sc.parallelize([4, 3, 2, 1])
+  print(f"First element of numbers is {numbers.first()}")
+  print(f"The RDD, numbers, has the following description\n{numbers.toDebugString()}")
+
+Spark SQL
+*********
+Using the ``-c sql`` option allows you to run Spark SQL commands in a cell. In this section, the `NYC Taxi and Limousine Commission (TLC) Data <https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`__ dataset is used. The size of the dataset is around **35GB**.
+
+The next cell reads the dataset into a Spark dataframe, and then saves it as a view used to demonstrate Spark SQL.
+
+
+Use the ``-c sql`` option to run Spark SQL commands in a cell.
+
+The next example demonstrates how a dataset can be created on the fly:
+
+.. code-block:: python
+
+  %%spark
+  df_nyc_tlc = spark.read.parquet("oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet", header=False, inferSchema=True)
+  df_nyc_tlc.show()
+  df_nyc_tlc.createOrReplaceTempView("nyc_tlc")
+
+The following cell uses the ``-c sql`` option to tell Data Flow Spark Magic that the contents of the cell is SparkSQL. The ``-o <variable>`` option takes the results of the Spark SQL operation and stores it in the defined variable. In this case, the ``df_people`` will be a Pandas dataframe that is available to be used in the notebook.
+
+.. code-block:: python
+
+  %%spark -c sql -o df_nyc_tlc
+  SELECT vendor_id, passenger_count, trip_distance, payment_type FROM nyc_tlc LIMIT 1000;
+
+Check the result:
+
+.. code-block:: python
+
+  print(type(df_nyc_tlc))
+  df_nyc_tlc.head()
@@ -14,11 +14,12 @@ Apache Spark
 
 .. toctree::
     :maxdepth: 2
-    
+
     quickstart
     setup-installation
     dataflow
     spark-defaults_conf
     ../data_catalog_metastore/index
+    dataflow-spark-magic
     ../data_flow/legacy_dataflow
 
@@ -21,7 +21,7 @@ You can choose to use the resource principal to authenticate while using the Acc
 
 .. code-block:: python
 
-  import ads 
+  import ads
   ads.set_auth(auth='resource_principal')
   compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID']
   pc = ProjectCatalog(compartment_id=compartment_id)
@@ -39,7 +39,7 @@ Use API Key setup when you are working from a local workstation or on platform w
 
 This is the default method of authentication. You can also authenticate as your own personal IAM user by creating or uploading OCI configuration and API key files inside your notebook session environment. The OCI configuration file contains the necessary credentials to authenticate your user against the model catalog and other OCI services like Object Storage. The example notebook, `api_keys.ipynb` demonstrates how to create these files.
 
-You can follow the steps in `api_keys.ipynb <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/ads_notebooks/api_keys.ipynb>` for step by step instruction on setting up API Keys. 
+You can follow the steps in `api_keys.ipynb <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/notebook_examples/api_keys.ipynb>` for step by step instruction on setting up API Keys.
 
 .. note::
    If you already have an OCI configuration file (``config``) and associated keys, you can upload them directly to the ``/home/datascience/.oci`` directory using the JupyterLab **Upload Files** or the drag-and-drop option.
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,2 @@`
`1`	`1`	* Data Catalog requires policies to be set in IAM. Refer to the Data Catalog documentation on how to `setup policies <https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm>`__.
`2`		-* The ``spark-defaults.conf`` file needs to be :ref:`configured<configuration-spark_defaults_conf>`.
`3`	`2`