Skip to content

0.7.0 Release Notes

Marek Slesicki edited this page Jul 27, 2016 · 1 revision

Wiki Home0.7.0 Release Notes

New Features

Time Series Analysis

Time series analysis is introduced, it comprises of methods analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Specifically, two of the most popular time series algorithms are available now:

  • ARIMA: AutoRegressive Integrated Moving Average
  • ARX: AutoRegressive with eXogenous variables

Time series analysis enables analyzes of data sequences made out of successive measurements across a time interval using equal spacing between any two consecutive ones. This is typical of IOT use cases, where one needs to predict events based on seasonal pattern (daily, weekly, monthly, etc) and sensor measurements.

Scoring Pipeline

Scoring Pipeline simplifies deployment of machine learning models, trained by data scientists in model development, to production to be used by application developers to build analytics-driven applications. It uses Python User Defined Functions (UDFs) to perform data processing and feature extraction, followed by Scoring Engine producing prediction results based on a previously trained model, on an incoming stream of observations in production.

PySpark

PySpark is the Python interface to Spark. PySpark terminal is accessible to end users through Jupyter notebook. The popular Anaconda Python distribution for Python 2.7 is also available through PySpark, along with native Spark libraries. Spark-Submit is supported both in YARN Client and Cluster modes. However, AT is not available through PySpark.

Note: PySpark is a preview feature, as it only works when Kerberos security is turned off.

OrientDB

OrientDB version 2.2 (as of May 2016) is available on TAP Marketplace. It is the graph database for TAP Analytics - it supports importing graphs (Parquet Graph) from OrientDB to perform analytics and the exporting the resulting graphs to OrientDB. Users can also browse the data using OrientDB dashboard.

Data Import Scheduler

Users can schedule a data import from a PostgreSql database. The scheduler enables users to define data source, provide credentials, specify which table to import, when to start the import and how frequently update the data later on. The data is imported into HDFS and is available to be used by TAP services and analytics.

Apache Gearpump

Apache Gearpump is a lightweight real-time big data streaming engine. Apache Gearpump is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. Gearpump is now integrated with TAP so that users can deploy Apache Gearpump applications directly from TAP console.

This real-time big data streaming engine has a better integration with TAP 0.7:

  • SSO is implemented so that a user can seamlessly move between TAP Console and GearPump Dashboard without requiring a second login,
  • One-step DAG deployment from TAP Console enables users to upload jars, configuration files, collect TAP’s services metadata, enter user defined parameters and push applications to Gearpump cluster.

The latest version of Gearpump was used (0.8.0) which allowed for improving YARN deployment immunity to timeouts and lack of resources.

New plans were also introduced so the user can decide on the size of the cluster to be created.

Finally, we are proud to announce that Gearpump is now under Apache Software Foundation incubation umbrella (http://gearpump.apache.org/).

EnableIoT ready for TAP

EnableIoT is now ready for new TAP and Gearpump 0.8.0. Code is now open sourced and has necessary installation scripts allowing users who have TAP to add EnableIoT platform to their organization.

New Rule Engine has been created which is running on Gearpump to collect and analyze time series data used in EnableIoT.

Application Broker

A new UI option is now provided to make applications available in the marketplace. After pushing an applications to TAP and making it available in the Applications menu, a user can select an application and choose to publish it as a marketplace item as well. This enables users to share and reuse applications within an organization.

Additionally, the user is now able to configure any internal component of group being provisioned. Before, root application was configurable, only. From now, by passing JSON object and providing proper namespaces all dependencies can be populated with their configuration values.

Version Tracking

TAP Console exposes versions of core platform components (CDH, CF and core TAP artifacts).

Moreover, there there is an option for administrator to access snapshots (performed on a schedule) of all artifacts used in TAP. Snapshots include information regarding running applications, services in marketplace and services in CDH.

Automated Health Check

TAP platform administrator can execute a set of automated tests that examine core areas of platform installation. The suite of tests initially deployed with TAP platform validates the correctness of installation and, when used also later, the overall health of TAP instance.

Improvements

Simplified Deployment

The new simplified deployment procedure greatly streamlines input information gathering and provides full automation of provisioning of all platform components.

TAP can be deployed on Amazon Web Services (AWS) using the AWS CloudFormation (CF) template or on OpenStack using OpenStack Heat Orchestration Template (HOT).

The first step of TAP provisioning uses CF/HOT to gather all installation and configuration parameters and then to create a base TAP infrastructure. Once the infrastructure is provided the provisioning of the entire software platform is automatically executed - Ansible playbooks are used as a homogenous deployment and configuration automation technology.

The deployment of the application layer has been migrated to a new python tool called ‘apployer’. It introduces greater robustness along with easier deployment configuration. Additionally, new deployment tool produces easy to analyze logs and simplifies troubleshooting.

Improved Analytics Performance for Small-Medium Data Sets

Users can now work more flexibly with small-medium sized datasets which are typical when working with sample data and iterating through model creation.

Furthermore, TAP Analytics users now have more data about job progress and programmatic control to release YARN resources, improving the efficiency of their resource usage.

Finally, based on internal testing, the typical performance seen on smaller data sets was up to 4x better on representative jobs run before and after these changes were introduced.

Enhanced TAP Security Model

TAP Security Model is a functionality of segregating data and access between organizations in TAP. Users can use TAP for different departments/groups of collaborators within their enterprises while guaranteeing that no data is shared between these departments or groups of collaborators.

Every call regarding organization level entities (create/delete org, add/remove user to/from org) is intercepted by auth-gateway, translated and propagated to Hadoop.

It consists of multiple brokers: hdfs, hive, yarn, hbase, kerberos, and zookeeper providing configuration that should be used by applications willing to integrate with TAP security model.

Moreover, security model supports translation of authentication/authorization mechanism used in Cloud Foundry to Hadoop - it supports SSO between the CF OAuth and the HDFS Kerberos security mechanisms.

AT Security Integration

HDFS broker was extended to provide two additional plans (create-user-directory, get-user-directory) which ensures that technical user for HDFS directory within organization is created. First plan registers new user in UAA, creates directory in HDFS and then returns credentials necessary to authenticate in platform (oauth2). Second plan is used only for retrieving technical user (if exist) which can access specified HDFS directory.

To further automate this process application broker supports passing parameters to service instances that it creates. In other words we can specify that scoring engine should be started with instance of get-user-directory and as a argument pass path to HDFS directory. During instantiation of scoring engine it will be injected with credentials of technical user that can access particular HDFS directory.

Key features provided:

  • data isolation at organization level (see TAP Security Model)
  • seamless integration of artifact/model access between AT and Scoring Engine deployed in the same organization
  • user authenticated submission of Yarn jobs

UI/UX Changes

Many changes have been introduced, including:

  • auto logout from WebUI after long inactivity period,
  • new icons in marketplace (Yarn, Jupyter, Cassandra),
  • provide status of asynchronously created services in Console,
  • added link to documentation and community site,
  • allowing users to register and deregister their applications in marketplace via WebUI,
  • and other minor improvements and changes.

Automatic Deployment of Demo/Sample Apps

A few of TAP sample applications can be now automatically deployed along with all their dependencies using a simple python script. It can be useful for both automatization and as a starting point for people new to TAP and Cloud Foundry. Further work will include adding the sample applications to a regular TAP deployment. See:

CDH Upgrade

CDH has been upgraded to version 5.5.1.

Cloud Foundry Upgrade

CF has been upgraded to version 222. See Cloud Foundry release notes for more information.

Browser compatibility

The following web browsers are compatible with TAP 0.7.0 release:

  • Google Chrome
  • Firefox

Fixed Problems & Issues

  1. TRACS-83: Problem in CloudFoundry: provisioning frequently fails during buildpack_php compilation
  2. TRACS-82: Transfer submission fails with 500 Internal Server Error right after deployment

Known Defects

  1. TRACS-137: Shell script to finish the installation doesn't work
  2. TRACS-138: cdh step failed due to arcadia issue with retrieving package
  3. TRACS-139: CDH ansible on new deployment occasionally fails on SSH errors
  4. TRACS-140: Cannot push kafka2hdfs on environments with kerberos
  5. TRACS-141: Creating kubernetes instances fails if cluster is during creation/deletion state
  6. TRACS-143: Kubernetes service exposing not working

Clone this wiki locally