Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 29 additions & 28 deletions docs/user/ppl/cmd/ad.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,41 +10,43 @@ ad (deprecated by ml command)


Description
============
===========
| The ``ad`` command applies Random Cut Forest (RCF) algorithm in the ml-commons plugin on the search result returned by a PPL command. Based on the input, the command uses two types of RCF algorithms: fixed in time RCF for processing time-series data, batch RCF for processing non-time-series data.


Fixed In Time RCF For Time-series Data Command Syntax
=====================================================
ad <number_of_trees> <shingle_size> <sample_size> <output_after> <time_decay> <anomaly_rate> <time_field> <date_format> <time_zone>
Syntax
======

* number_of_trees(integer): optional. Number of trees in the forest. The default value is 30.
* shingle_size(integer): optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
* sample_size(integer): optional. The sample size used by stream samplers in this forest. The default value is 256.
* output_after(integer): optional. The number of points required by stream samplers before results are returned. The default value is 32.
* time_decay(double): optional. The decay factor used by stream samplers in this forest. The default value is 0.0001.
* anomaly_rate(double): optional. The anomaly rate. The default value is 0.005.
* time_field(string): mandatory. It specifies the time field for RCF to use as time-series data.
* date_format(string): optional. It's used for formatting time_field field. The default formatting is "yyyy-MM-dd HH:mm:ss".
* time_zone(string): optional. It's used for setting time zone for time_field filed. The default time zone is UTC.
* category_field(string): optional. It specifies the category field used to group inputs. Each category will be independently predicted.
Fixed In Time RCF For Time-series Data
--------------------------------------
ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] <time_field> [date_format] [time_zone] [category_field]

* number_of_trees: optional. Number of trees in the forest. **Default:** 30.
* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. **Default:** 8.
* sample_size: optional. The sample size used by stream samplers in this forest. **Default:** 256.
* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32.
* time_decay: optional. The decay factor used by stream samplers in this forest. **Default:** 0.0001.
* anomaly_rate: optional. The anomaly rate. **Default:** 0.005.
* time_field: mandatory. Specifies the time field for RCF to use as time-series data.
* date_format: optional. Used for formatting time_field. **Default:** "yyyy-MM-dd HH:mm:ss".
* time_zone: optional. Used for setting time zone for time_field. **Default:** "UTC".
* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted.

Batch RCF for Non-time-series Data Command Syntax
=================================================
ad <number_of_trees> <sample_size> <output_after> <training_data_size> <anomaly_score_threshold>
Batch RCF For Non-time-series Data
----------------------------------
ad [number_of_trees] [sample_size] [output_after] [training_data_size] [anomaly_score_threshold] [category_field]

* number_of_trees(integer): optional. Number of trees in the forest. The default value is 30.
* sample_size(integer): optional. Number of random samples given to each tree from the training data set. The default value is 256.
* output_after(integer): optional. The number of points required by stream samplers before results are returned. The default value is 32.
* training_data_size(integer): optional. The default value is the size of your training data set.
* anomaly_score_threshold(double): optional. The threshold of anomaly score. The default value is 1.0.
* category_field(string): optional. It specifies the category field used to group inputs. Each category will be independently predicted.
* number_of_trees: optional. Number of trees in the forest. **Default:** 30.
* sample_size: optional. Number of random samples given to each tree from the training data set. **Default:** 256.
* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32.
* training_data_size: optional. **Default:** size of your training data set.
* anomaly_score_threshold: optional. The threshold of anomaly score. **Default:** 1.0.
* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted.

Example 1: Detecting events in New York City from taxi ridership data with time-series data
===========================================================================================

The example trains an RCF model and uses the model to detect anomalies in the time-series ridership data.
This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data.

PPL query::

Expand All @@ -59,7 +61,7 @@ PPL query::
Example 2: Detecting events in New York City from taxi ridership data with time-series data independently with each category
============================================================================================================================

The example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values.
This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values.

PPL query::

Expand All @@ -76,7 +78,7 @@ PPL query::
Example 3: Detecting events in New York City from taxi ridership data with non-time-series data
===============================================================================================

The example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data.
This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data.

PPL query::

Expand All @@ -91,7 +93,7 @@ PPL query::
Example 4: Detecting events in New York City from taxi ridership data with non-time-series data independently with each category
================================================================================================================================

The example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values.
This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values.

PPL query::

Expand All @@ -108,4 +110,3 @@ PPL query::
Limitations
===========
The ``ad`` command can only work with ``plugins.calcite.enabled=false``.
It means ``ad`` command cannot work together with new PPL commands/functions introduced in 3.0.0 and above.
22 changes: 9 additions & 13 deletions docs/user/ppl/cmd/append.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
=========
======
append
=========
======

.. rubric:: Table of contents

Expand All @@ -10,22 +10,18 @@ append


Description
============
| Using ``append`` command to append the result of a sub-search and attach it as additional rows to the bottom of the input search results (The main search).
The command aligns columns with the same field names and types. For different column fields between the main search and sub-search, NULL values are filled in the respective rows.

Version
=======
3.3.0
===========
| The ``append`` command appends the result of a sub-search and attaches it as additional rows to the bottom of the input search results (The main search).
| The command aligns columns with the same field names and types. For different column fields between the main search and sub-search, NULL values are filled in the respective rows.

Syntax
============
======
append <sub-search>

* sub-search: mandatory. Executes PPL commands as a secondary search.

Example 1: Append rows from a count aggregation to existing search result
===============================================================
=========================================================================

This example appends rows from "count by gender" to "sum by gender, state".

Expand All @@ -45,7 +41,7 @@ PPL query::
+----------+--------+-------+------------+

Example 2: Append rows with merged column names
====================================================================================
===============================================

This example appends rows from "sum by gender" to "sum by gender, state" with merged column of same field name and type.

Expand All @@ -65,7 +61,7 @@ PPL query::
+-----+--------+-------+

Example 3: Append rows with column type conflict
=============================================
================================================

This example shows how column type conflicts are handled when appending results. Same name columns with different types will generate two different columns in appended result.

Expand Down
42 changes: 7 additions & 35 deletions docs/user/ppl/cmd/appendcol.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,47 +11,15 @@ appendcol

Description
============
| (Experimental)
| (From 3.1.0)
| Using ``appendcol`` command to append the result of a sub-search and attach it alongside with the input search results (The main search).

Version
=======
3.1.0
The ``appendcol`` command appends the result of a sub-search and attaches it alongside with the input search results (The main search).

Syntax
============
======
appendcol [override=<boolean>] <sub-search>

* override=<boolean>: optional. Boolean field to specify should result from main-result be overwritten in the case of column name conflict.
* override=<boolean>: optional. Boolean field to specify should result from main-result be overwritten in the case of column name conflict. **Default:** false.
* sub-search: mandatory. Executes PPL commands as a secondary search. The sub-search uses the same data specified in the source clause of the main search results as its input.

Configuration
=============
This command requires Calcite enabled.

Enable Calcite::

>> curl -H 'Content-Type: application/json' -X PUT localhost:9200/_plugins/_query/settings -d '{
"transient" : {
"plugins.calcite.enabled" : true
}
}'

Result set::

{
"acknowledged": true,
"persistent": {
"plugins": {
"calcite": {
"enabled": "true"
}
}
},
"transient": {}
}

Example 1: Append a count aggregation to existing search result
===============================================================

Expand Down Expand Up @@ -103,6 +71,8 @@ PPL query::
Example 3: Append multiple sub-search results
=============================================

This example shows how to chain multiple appendcol commands to add columns from different sub-searches.

PPL query::

PPL> source=employees | fields name, dept, age | appendcol [ stats avg(age) as avg_age ] | appendcol [ stats max(age) as max_age ];
Expand All @@ -124,6 +94,8 @@ PPL query::
Example 4: Override case of column name conflict
================================================

This example demonstrates the override option when column names conflict between main search and sub-search.

PPL query::

PPL> source=employees | stats avg(age) as agg by dept | appendcol override=true [ stats max(age) as agg by dept ];
Expand Down
Loading