diff --git a/web/pandas/about/roadmap.md b/web/pandas/about/roadmap.md index 278143c01e7dc..3f1dc171daf2e 100644 --- a/web/pandas/about/roadmap.md +++ b/web/pandas/about/roadmap.md @@ -34,143 +34,3 @@ For more information about PDEPs, and how to submit one, please refer to {% endfor %} - -## Roadmap points pending a PDEP - - - -### Extensibility - -Pandas `extending.extension-types` allow -for extending NumPy types with custom data types and array storage. -Pandas uses extension types internally, and provides an interface for -3rd-party libraries to define their own custom data types. - -Many parts of pandas still unintentionally convert data to a NumPy -array. These problems are especially pronounced for nested data. - -We'd like to improve the handling of extension arrays throughout the -library, making their behavior more consistent with the handling of -NumPy arrays. We'll do this by cleaning up pandas' internals and -adding new methods to the extension array interface. - -### Apache Arrow interoperability - -[Apache Arrow](https://arrow.apache.org) is a cross-language development -platform for in-memory data. The Arrow logical types are closely aligned -with typical pandas use cases. - -We'd like to provide better-integrated support for Arrow memory and -data types within pandas. This will let us take advantage of its I/O -capabilities and provide for better interoperability with other -languages and libraries using Arrow. - -### Decoupling of indexing and internals - -The code for getting and setting values in pandas' data structures -needs refactoring. In particular, we must clearly separate code that -converts keys (e.g., the argument to `DataFrame.loc`) to positions from -code that uses these positions to get or set values. This is related to -the proposed BlockManager rewrite. Currently, the BlockManager sometimes -uses label-based, rather than position-based, indexing. We propose that -it should only work with positional indexing, and the translation of -keys to positions should be entirely done at a higher level. - -Indexing is a complicated API with many subtleties. This refactor will require care -and attention. The following principles should inspire refactoring of indexing code and -should result on cleaner, simpler, and more performant code. - -1. Label indexing must never involve looking in an axis twice for the same label(s). -This implies that any validation step must either: - - * limit validation to general features (e.g. dtype/structure of the key/index), or - * reuse the result for the actual indexing. - -2. Indexers must never rely on an explicit call to other indexers. -For instance, it is OK to have some internal method of `.loc` call some -internal method of `__getitem__` (or of their common base class), -but never in the code flow of `.loc` should `the_obj[something]` appear. - -3. Execution of positional indexing must never involve labels (as currently, sadly, happens). -That is, the code flow of a getter call (or a setter call in which the right hand side is non-indexed) -to `.iloc` should never involve the axes of the object in any way. - -4. Indexing must never involve accessing/modifying values (i.e., act on `._data` or `.values`) more than once. -The following steps must hence be clearly decoupled: - - * find positions we need to access/modify on each axis - * (if we are accessing) derive the type of object we need to return (dimensionality) - * actually access/modify the values - * (if we are accessing) construct the return object - -5. As a corollary to the decoupling between 4.i and 4.iii, any code which deals on how data is stored -(including any combination of handling multiple dtypes, and sparse storage, categoricals, third-party types) -must be independent from code that deals with identifying affected rows/columns, -and take place only once step 4.i is completed. - - * In particular, such code should most probably not live in `pandas/core/indexing.py` - * ... and must not depend in any way on the type(s) of axes (e.g. no `MultiIndex` special cases) - -6. As a corollary to point 1.i, `Index` (sub)classes must provide separate methods for any desired validity check of label(s) which does not involve actual lookup, -on the one side, and for any required conversion/adaptation/lookup of label(s), on the other. - -7. Use of trial and error should be limited, and anyway restricted to catch only exceptions -which are actually expected (typically `KeyError`). - - * In particular, code should never (intentionally) raise new exceptions in the `except` portion of a `try... exception` - -8. Any code portion which is not specific to setters and getters must be shared, -and when small differences in behavior are expected (e.g. getting with `.loc` raises for -missing labels, setting still doesn't), they can be managed with a specific parameter. - -### Numba-accelerated operations - -[Numba](https://numba.pydata.org) is a JIT compiler for Python code. -We'd like to provide ways for users to apply their own Numba-jitted -functions where pandas accepts user-defined functions (for example, -`Series.apply`, -`DataFrame.apply`, -`DataFrame.applymap`, and in groupby and -window contexts). This will improve the performance of -user-defined-functions in these operations by staying within compiled -code. - -### Documentation improvements - -We'd like to improve the content, structure, and presentation of the -pandas documentation. Some specific goals include - -- Overhaul the HTML theme with a modern, responsive design - (`15556`) -- Improve the "Getting Started" documentation, designing and writing - learning paths for users different backgrounds (e.g. brand new to - programming, familiar with other languages like R, already familiar - with Python). -- Improve the overall organization of the documentation and specific - subsections of the documentation to make navigation and finding - content easier. - -### Performance monitoring - -Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/) -to monitor for performance regressions. ASV itself is a fabulous tool, -but requires some additional work to be integrated into an open source -project's workflow. - -The [asv-runner](https://github.com/asv-runner) organization, currently -made up of pandas maintainers, provides tools built on top of ASV. We -have a physical machine for running a number of project's benchmarks, -and tools managing the benchmark runs and reporting on results. - -We'd like to fund improvements and maintenance of these tools to - -- Be more stable. Currently, they're maintained on the nights and - weekends when a maintainer has free time. -- Tune the system for benchmarks to improve stability, following - -- Build a GitHub bot to request ASV runs *before* a PR is merged. - Currently, the benchmarks are only run nightly.