Skip to content

Conversation

@fresh-borzoni
Copy link
Contributor

@fresh-borzoni fresh-borzoni commented Jan 11, 2026

Adds two new methods for column projection:

1. Table.new_log_scanner_with_projection(column_indices: List[int])
   Project columns by index (C++ parity)
   Example: scanner = table.new_log_scanner_with_projection([0, 2, 4])

2. Table.new_log_scanner_with_column_names(column_names: List[str])
   Project columns by name (Python-specific, more idiomatic!)
   Example: scanner = table.new_log_scanner_with_column_names(['id', 'name', 'email'])

Both methods create LogScanner with specified columns only, improving performance by reducing data transfer and processing overhead.

Implementation leverages core scanner.project() and scanner.project_by_name() APIs.
Error handling validates column indices/names before creating scanner.

Closes #149

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Left a couple of comments. PTAL

Comment on lines +123 to +125
let rust_scanner = table_scan.create_log_scanner().map_err(|e| {
FlussError::new_err(format!("Failed to create log scanner: {e}"))
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to use FlussError and PyErr within this file, for example line 72 to 75 uses PyErr. Can you clarify when each should be used?

            let rust_scanner = table_scan.create_log_scanner().map_err(|e| {
                PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(format!(
                    "Failed to create log scanner: {e:?}"
                ))
            })?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use FlussError, this is a leftover.

Thank you for catching this 👍

Comment on lines +184 to +198
# Project specific columns by index (C++ parity)
print("\n1. Projection by index [0, 1] (id, name):")
scanner_index = await table.new_log_scanner_with_projection([0, 1])
scanner_index.subscribe(None, None)
df_projected = scanner_index.to_pandas()
print(df_projected.head())
print(f" Projected {df_projected.shape[1]} columns: {list(df_projected.columns)}")

# Project specific columns by name (Python-specific, more idiomatic!)
print("\n2. Projection by name ['name', 'score'] (Pythonic):")
scanner_names = await table.new_log_scanner_with_column_names(["name", "score"])
scanner_names.subscribe(None, None)
df_named = scanner_names.to_pandas()
print(df_named.head())
print(f" Projected {df_named.shape[1]} columns: {list(df_named.columns)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the polling part also be included (as with C++ example)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get what you mean.

to_pandas() polls internally, but if we're talking about adding a separate polling API to Python bindings - we can add it.

Let's file an issue for it, as it's orthogonal to the column projection feature.

Copy link
Contributor Author

@fresh-borzoni fresh-borzoni Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue #152

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realise that to panda polls. Thank you for the clarification

@fresh-borzoni
Copy link
Contributor Author

@leekeiabstraction Thank you for the review.

Addressed the comments. PTAL 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python bindings missing scanner projection support

2 participants