Skip to content

Commit df77f37

Browse files
authored
docs: Clarify HLL in extraction precedence (apache#63723)
1 parent 0b43077 commit df77f37

File tree

2 files changed

+12
-1
lines changed

2 files changed

+12
-1
lines changed

devel-common/src/sphinx_exts/templates/openlineage.rst.jinja2

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,9 @@ the integration can go further. Besides recording which assets were read or writ
6969
it may also extract the executed SQL text, external query/job IDs. For each query a separate pair of child OpenLineage
7070
events is emitted.
7171

72+
For details on when hook-level lineage is attached to the OpenLineage event and how it interacts with
73+
extractors and inlets/outlets, see :ref:`extraction_precedence:openlineage`.
74+
7275
.. important::
7376
The level of detail captured varies between hooks and methods. Some may only report dataset information, while others
7477
expose SQL text, query IDs and more. Review the hook implementation to confirm what lineage data is available.

providers/openlineage/docs/guides/developer.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,15 @@ it's important to keep in mind the order in which OpenLineage looks for lineage
4141

4242
1. **Extractor** - check if there is a custom Extractor specified for Operator class name. Any custom Extractor registered by the user will take precedence over default Extractors defined in Airflow Provider source code (f.e. BashExtractor).
4343
2. **OpenLineage methods** - if there is no Extractor explicitly specified for Operator class name, DefaultExtractor is used, that looks for OpenLineage methods in Operator.
44-
3. **Inlets and Outlets** - if there are no OpenLineage methods defined in the Operator, inlets and outlets are checked.
44+
3. **Hook Level Lineage** - when extractor or Openlineage methods return no inputs and no outputs, hook lineage is merged
45+
with any other metadata produced (e.g. run facets, job facets). When neither extractor nor Openlineage methods
46+
are present, hook lineage is used directly as the full lineage result. In both cases it takes precedence over inlets
47+
and outlets.
48+
4. **Inlets and Outlets** - only consulted as a last resort when all of the above yield no datasets. This step
49+
attempts to convert inlets and outlets into OpenLineage input/output datasets, which has limited support.
50+
Note that inlets and outlets defined as Airflow Assets are always included in the ``airflow`` run facet
51+
(under ``task.inlets`` / ``task.outlets``) regardless of whether this conversion succeeds —
52+
so the ``airflow`` run facet is the most reliable place to look for inlet/outlet information.
4553

4654
If all the above options are missing, no lineage data is extracted from the Operator. You will still receive OpenLineage events
4755
enriched with things like general Airflow facets, proper event time and type, but the inputs/outputs will be empty

0 commit comments

Comments
 (0)