Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
3221f07
Assign type to symbol in TestRemoveUnreferencedScalarLateralNodes
losipiuk Jun 29, 2017
85d057f
Build instruction with maven wrapper
Lewuathe Jun 27, 2017
81646eb
Add doc for statistics
losipiuk Jun 1, 2017
ae0c58f
Remove display-only stats tests
kokosing Jun 1, 2017
138984c
Rename cost concept to stats
losipiuk Jun 22, 2017
ab71df0
Test for cost/stats calculation done by CoefficientBasedStatsCalculator
kokosing Jun 1, 2017
f020651
Allow passing external QueryRunner to RuleTester
losipiuk Jun 1, 2017
6dbe22c
Add rule to push down table constraints
rschlussel-zz Jun 1, 2017
910a464
Add ATQ test for table constraints infinite loop bug
rschlussel-zz Jun 27, 2017
1f1aadf
Remove Filter above TableScan logic from CoefficientBasedStatsCalculator
losipiuk Jun 28, 2017
4be5065
Change StatsCalculator API to use Lookup for computing child costs
rschlussel-zz Jun 1, 2017
c48f317
Use doubles instead Estimates in PlanNodeStatsEstimate
losipiuk Jun 14, 2017
a50b939
Add precondition checks in PlanNodeStatsEstimate constructor
losipiuk Jun 26, 2017
92becb5
Estimate default outputSizeInBytes based on outputRows
pnowojski Jun 1, 2017
34f047d
Introduce CostCalculator interface
pnowojski Jun 1, 2017
73d4bbc
Print cost estimate in Explain
pnowojski Jun 1, 2017
5511d8b
Refactor getQueryMaxMemory session property getter
pnowojski Jun 1, 2017
ebe1caf
Introduce CostComparator
pnowojski Jun 1, 2017
2d24854
Replace nulls count with nulls fraction in column statistics
losipiuk Jun 1, 2017
576b5aa
Introduce range column statistics
losipiuk Jun 1, 2017
d09d4f4
Return low/high value in show stats
losipiuk Jun 1, 2017
9219ea3
Expose min/max value in statistics for Hive tables
losipiuk Jun 1, 2017
f8cc1c9
Change argument type in TpchMetadata.getPrestoType to TpchColumn
ArturGajowy Jun 1, 2017
7694cd1
Add Constraint.alwaysFalse() method
ArturGajowy Jun 1, 2017
1cfc472
Add TestTpchMetadata
ArturGajowy Jun 1, 2017
8fb8c16
Add Optionals.{checkPresent,withBoth,combine} util methods
ArturGajowy Jun 1, 2017
88786fc
Add Types.{checkType,checkSameTypes,tryCast} util methods
ArturGajowy Jun 1, 2017
bd8980d
Add RecordTpchTableStatsTool for recording stats summaries in .json
ArturGajowy Jun 1, 2017
1355036
Add statistics recordings for tpch.{tiny,sf1}
ArturGajowy Jun 1, 2017
8e4f4db
Add EstimateAssertion
ArturGajowy Jun 1, 2017
f66eee2
Support column stats in TPCH connector
ArturGajowy Jun 1, 2017
831b5f6
Introduce DomainConverted
fiedukow May 29, 2017
daa16b6
Add clearRanges to ColumnStatistics.builder()
losipiuk Jun 2, 2017
bd958e4
Add Symbol statistics to PlanNodeStatsEstimate
losipiuk Jun 14, 2017
baefcf5
Add mapping functions to SymbolStatsEstimate
kokosing Jun 22, 2017
7887421
Remove explicit data_size from PlanNodeStatsEstimate
losipiuk Jun 2, 2017
7a0947e
Add OutputNode support to PlanBuilder
losipiuk Jun 12, 2017
323d050
Add ColumnStatistics.UNKNOWN_COLUMN_STATISTICS
losipiuk Jun 1, 2017
06d0c26
Support column stats in StatisticsAssertion
losipiuk Jun 5, 2017
b8d3d06
Add PlanNodeStatisticsAssertion
losipiuk Jun 8, 2017
3c95d64
Add support for LocalQueryRunner in StatisticsAssertion
losipiuk Jun 6, 2017
ea3146f
Add ComposableStatsCalculator
losipiuk May 30, 2017
3f67801
Allow using ComposableStatsCalculator instead of CoefficientBasedStat…
losipiuk May 30, 2017
910a806
Add stats calculator unit testing framework
losipiuk Jun 12, 2017
292e217
Add TestTpchLocalStats stub
losipiuk Jun 6, 2017
973990e
Add FilterStatsCalculator
fiedukow Jun 21, 2017
453bf81
Add stats calculation for FilterNode comparisons
fiedukow Jun 30, 2017
ecc1219
Add stats calculation for FilterNode boolean expressions
fiedukow Jun 30, 2017
afc5195
Add stats calculation for FilterNode logical operations
fiedukow Jun 30, 2017
408e859
Add stats calculation for FilterNode is (not) null expression
fiedukow Jun 30, 2017
2155c51
Add stats calculation for FilterNode comparison related operators
fiedukow Jun 30, 2017
f22af55
Add ScalarStatsCalculator
losipiuk Jun 14, 2017
f589217
Add OutputStatsRule
losipiuk May 31, 2017
5e2e1ba
Add support for low/high values in StatisticsAssertion
losipiuk Jun 7, 2017
c61de10
Add TableScanStatsRule
losipiuk May 30, 2017
52ba88c
Do not check symobol statistics in PlanStatsMatcher
losipiuk Jun 29, 2017
7a1bba6
TestStatsCalculator tests new version of calculator
losipiuk Jun 29, 2017
7117fed
Add rule for computing stats for ValuesNode
losipiuk May 30, 2017
034c4d1
Add LimitStatsRule (TODO)
losipiuk May 30, 2017
ee15131
Add EnforceSingleRowStatsRule (TODO)
losipiuk Jun 27, 2017
1632140
Implement ExchangeStatsRule for single source
losipiuk May 30, 2017
bda3f04
Add ProjectStatsRule (TODO)
losipiuk May 30, 2017
e64c6d6
Add rule for computing stats for FilterNode
fiedukow May 30, 2017
341c307
Add verifyExactColumnStatistics utility method to StatisticsAssertion
sopel39 Jun 8, 2017
9cab20c
Add scalar stats estimation for SymbolReference
losipiuk Jun 1, 2017
8bba910
Add scalar stats estimation for Literal
losipiuk Jun 14, 2017
65c8709
Use new stats calculator by default
fiedukow Jun 16, 2017
b0e80ea
Use nearlyEquals for estimate comparison
kokosing Jun 19, 2017
b578f09
Add scalar stats estimation for Cast
losipiuk Jun 13, 2017
87439d3
Add scalar stats estimation for ArithmeticBinaryExpression
kokosing Jun 20, 2017
e9dda83
Add scalar stats estimation for CoalesceExpression
kokosing Jun 20, 2017
9e63fa6
Add SymbolStatsAssertion.isEqualTo
losipiuk Jun 20, 2017
1089b02
Add test for symbol reference stats calculation
losipiuk Jun 20, 2017
916bbe5
Add JoinStatsRule to support equi-conditions and filters
sopel39 May 31, 2017
c259cd5
Make StatsCalculaterTester to be closeable
kokosing Jun 22, 2017
0ce7ac1
Add statistic estimation for simple AggregationNode
kokosing Jun 22, 2017
dee93b8
Introduce stats Normalizer
kokosing Jun 27, 2017
4f02296
Cap distinct values count to ouptut rows count
kokosing Jun 27, 2017
243cf81
Cap distinct values count to type domain range length
kokosing Jun 27, 2017
527c17f
Ensure all output symbols have stats estimates
kokosing Jun 27, 2017
1957412
Move pattern matching to separate package
kokosing Jun 27, 2017
1e9e61b
Use TreeTraverser in MatchingEngine
kokosing Jun 29, 2017
bd0b055
Check interfaces of given object when checking pattern
kokosing Jun 29, 2017
838afb8
Use Set collection to store stats estimation rules
kokosing Jun 28, 2017
c970f54
Use pattern matching in ComposableStatsCalculator
kokosing Jun 28, 2017
d7c7837
Add UnionStatsRule
kokosing Jul 3, 2017
dd17a30
Support multi source exchange
kokosing Jul 1, 2017
8747d7c
fixup! Add stats calculation for FilterNode comparisons
kokosing Jul 6, 2017
4246439
fixup! Add JoinStatsRule to support equi-conditions and filters
kokosing Jul 6, 2017
448f580
fixup! Add UnionStatsRule
kokosing Jul 6, 2017
243aa54
Add IntersectStatsRule
kokosing Jul 6, 2017
731ed0a
fixup! Introduce CostCalculator interface
ArturGajowy Jul 6, 2017
3d77f96
fixup! Introduce CostComparator
ArturGajowy Jul 6, 2017
e0a7d42
fixup! Introduce CostCalculator interface
ArturGajowy Jul 6, 2017
b1382b6
fixup! Introduce CostCalculator interface
ArturGajowy Jul 6, 2017
48b519c
Add TestCostCalculator.CostAssertionBuilder#cpu/network/memoryUnknown()
ArturGajowy Jul 6, 2017
21d782c
Make unknown costs the default in PlanNodeCostEstimate.Builder
ArturGajowy Jul 6, 2017
0c8b9b7
Test that CostCalculator successfully returns unknown costs for no stats
ArturGajowy Jul 6, 2017
ee9f2e8
fixup! Remove explicit data_size from PlanNodeStatsEstimate
ArturGajowy Jul 6, 2017
1d67379
Introduce caching cost and stats calculator
sopel39 Jul 4, 2017
6eb0374
Remove @ThreadSafe annotation from CostCalculator interface
sopel39 Jul 4, 2017
7428f72
Add pcollections library dependency
sopel39 Jul 4, 2017
3a22e63
SEE TODO: Use HashTreePMap in PlanNodeStatsEstimate to reduce map cop…
sopel39 Jul 4, 2017
356b8a8
Remove TestMemoBasedLookup as there is no MemoBasedLookup anymore
sopel39 Jul 10, 2017
0379997
fixup! Introduce CostCalculator interface
sopel39 Jul 11, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ See the [User Manual](https://prestodb.io/docs/current/) for deployment instruct

Presto is a standard Maven project. Simply run the following command from the project root directory:

mvn clean install
./mvnw clean install

On the first build, Maven will download all the dependencies from the internet and cache them in the local repository (`~/.m2/repository`), which can take a considerable amount of time. Subsequent builds will be faster.

Presto has a comprehensive set of unit tests that can take several minutes to run. You can disable the tests when building:

mvn clean install -DskipTests
./mvnw clean install -DskipTests

## Running Presto in your IDE

Expand Down
6 changes: 6 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -588,6 +588,12 @@
<version>42.0.0</version>
</dependency>

<dependency>
<groupId>org.pcollections</groupId>
<artifactId>pcollections</artifactId>
<version>2.1.2</version>
</dependency>

<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
Expand Down
1 change: 1 addition & 0 deletions presto-docs/src/main/sphinx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Presto Documentation
language
sql
migration
optimizer
develop
release

Expand Down
9 changes: 9 additions & 0 deletions presto-docs/src/main/sphinx/optimizer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
***************
Query Optimizer
***************

.. toctree::
:maxdepth: 1

optimizer/statistics
optimizer/cost-in-explain
37 changes: 37 additions & 0 deletions presto-docs/src/main/sphinx/optimizer/cost-in-explain.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
===============
Cost in EXPLAIN
===============

During planning, the cost associated with each node of the plan is computed based on the root table statistics
for the tables in the query. This calculated cost is printed as part of the output of an ``EXPLAIN`` statement.

Cost information is displayed in the plan tree using the format ``{rows: XX, bytes: XX}``. ``rows`` refers to the
expected number of rows output by each plan node during execution. ``bytes`` refers to the expected size of the
data output by each plan node in bytes. If any of the values is not known, a ``?`` is printed.

For example:

.. code-block:: none

presto:default> EXPLAIN SELECT comment FROM nation_with_column_stats WHERE nationkey > 3

- Output[comment] => [comment:varchar(152)] {rows: ?, bytes: ?}
- RemoteExchange[GATHER] => comment:varchar(152) {rows: 12, bytes: ?}
- ScanFilterProject[table = hive:hive:default:nation_with_column_stats,
originalConstraint = (""nationkey"" > BIGINT '3'),
filterPredicate = (""nationkey"" > BIGINT '3')] => [comment:varchar(152)] {rows: 25, bytes: ?}/{rows: 12, bytes: ?}/{rows: 12, bytes: ?}
LAYOUT: hive
nationkey := HiveColumnHandle{clientId=hive, name=nationkey, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}
comment := HiveColumnHandle{clientId=hive, name=comment, hiveType=varchar(152), hiveColumnIndex=3, columnType=REGULAR}

Generally there is only one cost printed for each plan node.
However, when a ``Scan`` operator is combined with a ``Filter`` and/or ``Project`` operator, then multiple cost structures will be printed,
each corresponding to an individual logical part of the combined meta-operator.
For example, for a ``ScanFilterProject`` operator three cost structures will be printed.

* the first will correspond to ``Scan`` part of operator
* the second will correspond to ``Filter`` part of opertor
* the third will corresponde to ``Project`` part of operator

Estimated cost is also printed in ``EXPLAIN ANALYZE`` in addition to actual runtime statistics.

146 changes: 146 additions & 0 deletions presto-docs/src/main/sphinx/optimizer/statistics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
================
Table Statistics
================

Presto supports statistics based optimizations for queries. For a query to take advantage of these optimizations,
Presto must have statistical information for the tables in that query.

Table statistics are provided to the query planner by connectors.
Currently the only connector that supports statistics is the :doc:`/connector/hive`.

Table Layouts
-------------

Statistics are exposed to the query planner by a table layout. A table layout represents a subset of a table's data
and contains information about the organizational properties of that data (like sort order and bucketing).

The number of table layouts available for a table and the details of those table layouts are specific to each connector.
Using the Hive connector as an example:

* Non-partitioned tables have just one table layout representing all data in the table
* Partitioned tables have a family of table layouts. Each set of partitions to be scanned represents one table layout.
Presto will try to pick a table layout consisting of the smallest number of partitions based on filtering predicates
from the query.

Available Statistics
--------------------

Currently, the following statistics are available in Presto:

* For the table:

* **row count**: the total number of rows for the table layout

* For each column in a table:

* **data size**: the data size that needs to be read
* **nulls fraction**: the fraction of null values
* **distinct value count**: the number of distinct values
* **low value**: the smallest value in the column
* **high value**: the largest value in the column


The set of statistics available for a particular query depends on the connector being used and can also vary by table or
even by table layout. For example, the Hive connector does not currently provide statistics on data size.

Displaying Table Statistics
---------------------------

Table statistics can be displayed via the Presto SQL interface using the ``SHOW STATS`` command.
There are two flavors of the command:

* ``SHOW STATS FOR <table_name>`` will show statistics for the table layout representing all data in the table
* ``SHOW STATS FOR (SELECT <column_list|*> FROM <table_name> WHERE <filtering_condition>)``
will show statistics for the table layout of table ``t`` representing a subset of data after applying the given filtering
condition. Both the column list and the filtering condition used in the ``WHERE`` clause can reference table columns.

In both cases, the ``SHOW STATS`` command outputs two types of rows.
For each column in the table there is a row with ``column_name`` equal to the name of that column.
These rows expose column-related statistics for a table (data size, nulls count, distinct values count, min value, max value).
Additionally there is one row with NULL as the ``column_name``. This row contains table-layout wide statistics - for now just the row count.

For example:

.. code-block:: none

presto:default> SHOW STATS FOR nation;

column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+--------------------+--------------------
regionkey | NULL | 5.0 | 0.0 | NULL | 0 | 4
name | NULL | 25.0 | 0.0 | NULL | ALGERIA | VIETNAM
comment | NULL | 25.0 | 0.0 | NULL | haggle. carefu... | y final package...
nationkey | NULL | 25.0 | 0.0 | NULL | 0 | 24
NULL | NULL | NULL | NULL | 25.0 | NULL | NULL
(5 rows)


presto:default> SHOW STATS FOR (SELECT * FROM nation WHERE nationkey > 10);

column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+--------------------+--------------------
regionkey | NULL | 5.0 | 0.0 | NULL | 0 | 4
name | NULL | 9.0 | 0.0 | NULL | IRAN | VIETNAM
comment | NULL | 14.0 | 0.0 | NULL | pending excuse... | y final package...
nationkey | NULL | 3.0 | 0.0 | NULL | 10 | 24
NULL | NULL | NULL | NULL | 25.0 | NULL | NULL
(5 rows)

If provided ``SELECT`` will filter out all of the partitions (all table layouts),
then the ``SHOW STATS`` will return no statistic which will be represented as in example below.

.. code-block:: none

presto:default> SHOW STATS FOR (SELECT * FROM nation WHERE nationkey > 999);

column_name
-------------
NULL
(1 row)

Note, that currently providing ``column_list`` instead of ``*`` in ``SELECT`` will not influence the output table.

For example:

.. code-block:: none

presto:default> SHOW STATS FOR (SELECT comment FROM nation WHERE nationkey > 10);

column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+--------------------+--------------------
regionkey | NULL | 5.0 | 0.0 | NULL | 0 | 4
name | NULL | 9.0 | 0.0 | NULL | IRAN | VIETNAM
comment | NULL | 14.0 | 0.0 | NULL | pending excuse... | y final package...
nationkey | NULL | 3.0 | 0.0 | NULL | 10 | 24
NULL | NULL | NULL | NULL | 25.0 | NULL | NULL
(5 rows)


Updating Statistics For Hive Tables
-----------------------------------

For the Hive connector, Presto uses the statistics that are managed by Hive and exposed via the Hive metastore API.
Depending on the Hive configuration, table statistics may not be updated automatically.

If statistics are not updated automatically, the user needs to trigger a statistics update via the Hive CLI.

The following command can be used in the Hive CLI to update table statistics for non-partitioned table ``t``::

hive> ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS

For partitioned tables, partitioning information must be specified in the command.
Assuming table ``t`` has two partitioning keys ``a`` and ``b``, the following command would
update the table statistics for all partitions::

hive> ANALYZE TABLE t PARTITION (a, b) COMPUTE STATISTICS FOR COLUMNS

It is also possible to update statistics for just a subset of partitions.
This command will update statistics for all partitions for which partitioning key ``a`` is equal to ``1``::

hive> ANALYZE TABLE t PARTITION (a=1, b) COMPUTE STATISTICS FOR COLUMNS

And this command will update statistics for just one partition::

hive> ANALYZE TABLE t PARTITION (a=1, b=5) COMPUTE STATISTICS FOR COLUMNS

For documentation on Hive's statistics mechanism see https://cwiki.apache.org/confluence/display/Hive/StatsDev
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,6 @@ public HiveMetadata create()
partitionUpdateCodec,
typeTranslator,
prestoVersion,
new MetastoreHiveStatisticsProvider(typeManager, metastore));
new MetastoreHiveStatisticsProvider(typeManager, metastore, timeZone));
}
}
Loading