-
Notifications
You must be signed in to change notification settings - Fork 8
Getting started with spark tk
##Getting Started with Spark-tk
Spark-tk is supported in TAP versions 0.7.3 and 0.7.4.
Spark-tk is an analytics toolkit library that is compatible with Apache Spark. It provides APIs for Python and Scala. This page explains how to use Spark in TAP.
Visit https://github.com/trustedanalytics/spark-tk for additional information.
The easiest way to get started with Spark-tk on TAP is within a Jupyter notebook, as follows:
-
First, create a Jupyter notebook.
-
Open your Jupyter instance and navigate to /examples/tklibs/sparktk/README.ipynb
-
The README notebook demonstrates how to create a TkContext for Spark-tk and contains some simple Spark-tk code.
The other example notebooks show how to use Datacatalog, Frame, Latent Dirichlet Allocation, and Logistic Regression with Spark-tk.
More information about Spark is available on the Apache Spark website
###Accessing a terminal from Jupyter
-
From the Jupyter dashboard, select the New button located in the upper right.
-
Select Terminal from the sub menu to open a new terminal within Jupyter.
You can enter CLI commands in the terminal window.
##Troublshooting Tips
Q: I am using Spark-tk and want to save files/export models to my local file system instead of HDFS. How do I do that?
The SparkContext created by TkContext follows the system's current Spark configuration. If your system defaults to HDFS, but you want to use a local file system instead, include use_local_fs=True when creating your TkContext, as follows:
tc = sparktk.TkContext(use_local_fs=True)
If switching from ATK to Spark-tk, some additional information is available at: https://github.com/trustedanalytics/platform-wiki-0.7/wiki/Switching-from-Analytics-Toolkit-to-spark-tk-Library.