Skip to content

Commit 406ad32

Browse files
committed
Incorporate feedback and revise
1 parent 82110ac commit 406ad32

File tree

2 files changed

+32
-8
lines changed

2 files changed

+32
-8
lines changed

2024-landscape.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,10 @@ site:
77
# 2024 Landscape Analysis
88

99
Python is widely adopted in data science, and its use for statistics is expanding rapidly---particularly in education and applied research.
10-
The statistical ecosystem in Python is currently anchored by four major libraries:
10+
The statistical ecosystem in Python is currently anchored by six major libraries:
1111

12+
- [numpy](https://www.numpy.org/), which provides fast, flexible array and numerical operations and underpins nearly all statistical and scientific computing in Python.
13+
- [pandas](https://www.pandas.org/), which offers intuitive, high-performance data structures for tabular and time series data, making data cleaning, wrangling, and exploration straightforward and efficient.
1214
- [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), which provides a comprehensive suite of probability distributions, summary statistics, and basic statistical tests.
1315
- [statsmodels](https://www.statsmodels.org/), which offers tools for econometrics, classical statistics, and statistical modeling---including linear and generalized linear models, time series analysis, and hypothesis testing.
1416
- [scikit-learn](https://scikit-learn.org/), which is best known for machine learning but also supports some statistical modeling, offering a consistent API for predictive analytics and data preprocessing.
@@ -21,7 +23,7 @@ Libraries like scikit-learn are especially valued for their clean, consistent in
2123
While there are many smaller, specialized packages available, the ecosystem remains dominated by these large, general-purpose libraries.
2224
This concentration of resources ensures stability and quality but can also limit the visibility and adoption of innovative or niche statistical tools.
2325
As Python's role in statistics continues to grow, fostering a more diverse and accessible ecosystem will be key to meeting the evolving needs of educators, researchers, and practitioners.
24-
This will also require increased statistics methods developers' participation in the core packages.
26+
This will also require increased participation from statistics methods developers in the core packages.
2527

2628
# Relationship to Other Languages
2729

@@ -38,7 +40,7 @@ The R ecosystem also benefits from substantial contributions from statistics met
3840
| Core Libraries | [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), [statsmodels](https://www.statsmodels.org/), [scikit-learn](https://scikit-learn.org/) | [base R](https://www.r-project.org/), [tidyverse](https://www.tidyverse.org/), many CRAN packages |
3941
| User Experience | Fragmented, less cohesive | Cohesive, tidyverse pipelines, RStudio |
4042
| Teaching Resources | Improving, but less abundant | Extensive, beginner-friendly |
41-
| Community | Large, less connected in statistics | Strong, statistics-focused, welcoming |
43+
| Community | Large, but less connected in statistics | Strong, statistics-focused, welcoming |
4244
| Package Development | High barriers, less modularity | Easy, many small packages, dev tools |
4345
| Interoperability | Needs improvement (data structures, APIs) | Strong within tidyverse, RStudio |
4446
| Branding | Data science/machine learning focus | Statistics-focused |
@@ -57,10 +59,19 @@ Despite Python's strengths, several challenges remain.
5759
- **Fragmentation**: The ecosystem is fragmented, with major libraries (e.g., statsmodels vs. scikit-learn) adopting incompatible APIs and workflows, leading to confusion for users and students.
5860
- **User Experience**: There is no central landing place or unified entry point for statistics in Python, unlike R's [tidyverse](https://www.tidyverse.org/) or RStudio, making it harder for newcomers to get started.
5961
- **Interoperability**: Data structures (such as those from [pandas](https://pandas.pydata.org/) and [NumPy](https://numpy.org/)) do not always work seamlessly across libraries, requiring conversions and leading to unpredictable function outputs compared to R's tidyverse pipelines.
62+
Moreover some statistical methods use the results of other statistical subroutines (e.g., a multiple testing adjustment might be applied to the results of a number of different tests).
63+
At the moment there is limited support for putting statistical methods together as subroutines.
6064
- **Teaching Resources**: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community.
6165
- **Contributor Barriers**: Contributing to core libraries can be difficult due to high standards and lack of modularity.
6266
Small, specialized packages exist but are less visible and less widely used than in R.
63-
- **Statistical Methods Coverage**: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast [CRAN](https://cran.r-project.org/) repository.
67+
- **Statistical Methods Coverage**: Support for basic methods could be improved; moreover, Python's advanced or niche statistical methodology support generally falls behind R's vast [CRAN](https://cran.r-project.org/) repository.
68+
- **Comprehensive tooling for statistical analysis**: Data analysts using statistical methods need more than just the `p`-value for a statistical test or coefficient for a regression model.
69+
There are well-established numerical and visual diagnostics that accompany many statistical methods, but typically have limited support in existing packages.
70+
Moreover, analysts need to communicate their results through a variety of mediums and there is often minimal communication support built into Python statistical software.
71+
- **Abstracting the core computation from the statistical methodology**: Many computations required in statistics (e.g. solving the optimization problem associated with a generalized linear model) have a variety of algorithmic options.
72+
While most statistical packages implement one (or a couple of) algorithms, there is rarely one "right" algorithm for every scenario.
73+
Depending on the size of the data, available hardware, analysis needs, etc., there can be multiple algorithms an analyst might want to use.
74+
Many Python statistical software packages tightly couple the core computation with the rest of the methodology, which makes it difficult to provide better computational approaches.
6475
- **Community and Culture**: The Python statistics community is less cohesive and connected than R's, which benefits from a strong identity and established events.
6576

6677
# Conclusion

about.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,21 @@ site:
66

77
# About
88

9-
The Statistical Python project was launched with support from a [grant from the NSF](https://nsf.elsevierpure.com/en/projects/pose-phase-1-an-open-source-ecosystem-for-statistical-python), titled "POSE: Phase I: An open-source ecosystem for statistical Python."
10-
We are now completing Phase I, which has centered on scoping activities to inform the transition into a sustainable open-source ecosystem.
11-
During this phase, we conducted interviews with stakeholders across the statistical and scientific Python communities, engaged with related domain-stack OSEs to learn from their experiences, and organized a workshop to gather input on community needs and technical directions.
9+
The Statistical Python project was launched with support from a [grant from the NSF](https://nsf.elsevierpure.com/en/projects/pose-phase-1-an-open-source-ecosystem-for-statistical-python), titled _"POSE: Phase I: An open-source ecosystem for statistical Python."_
10+
We are now completing Phase I, which has focused on scoping activities to inform the transition into a sustainable open-source ecosystem.
11+
During this phase, we conducted interviews with stakeholders across the statistical and scientific Python communities, engaged with related domain-specific OSEs to learn from their experiences, led group discussions at national and international conferences, and organized a workshop to gather input on community needs and technical priorities.
1212

13-
Based on our [2024 Landscape Analysis](2024-landscape), we ...
13+
## Audience / Target Groups
14+
15+
We help:
16+
17+
- **Educators** teach statistics using a comprehensive, free computational ecosystem with clear user interfaces and accessible learning materials.
18+
- **Researchers** produce reliable results through an extensive collection of well-engineered and tested computational libraries, featuring intuitive APIs and comprehensive documentation.
19+
- **Method developers** share their innovations easily with a wide audience through standardized packaging and distribution channels.
20+
- **Practicing statisticians and data scientists** access powerful tools to compute results efficiently, without the friction of switching between different software ecosystems.
21+
22+
We foster a sustainable ecosystem, aiming to attract statisticians who actively participate in developing the tools they use daily.
23+
24+
## Landscape Analysis
25+
26+
Read our [2024 Landscape Analysis](2024-landscape), a primary output from our Phase I activities.

0 commit comments

Comments
 (0)