|
2 | 2 |
|
3 | 3 | > DataFrames on AWS
|
4 | 4 |
|
5 |
| -[](https://pypi.org/project/awswrangler/) |
| 5 | +[](https://pypi.org/project/awswrangler/) |
6 | 6 | [](https://pypi.org/project/awswrangler/)
|
7 | 7 | [](https://pypi.org/project/awswrangler/)
|
8 | 8 | [](https://aws-data-wrangler.readthedocs.io/en/latest/?badge=latest)
|
9 | 9 | [](https://pypi.org/project/awswrangler/)
|
10 | 10 | [](http://isitmaintained.com/project/awslabs/aws-data-wrangler "Average time to resolve an issue")
|
11 | 11 | [](https://opensource.org/licenses/Apache-2.0)
|
12 | 12 |
|
13 |
| -**[Read the Docs!](https://aws-data-wrangler.readthedocs.io)** |
| 13 | +## [Read the Docs](https://aws-data-wrangler.readthedocs.io) |
14 | 14 |
|
15 |
| -**[Read the Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials): [Catalog & Metadata](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/catalog_and_metadata.ipynb) | [Athena Nested](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/athena_nested.ipynb) | [S3 Write Modes](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/s3_write_modes.ipynb)** |
| 15 | +## [Read the Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials) |
| 16 | +- [Catalog & Metadata](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/catalog_and_metadata.ipynb) |
| 17 | +- [Athena Nested](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/athena_nested.ipynb) |
| 18 | +- [S3 Write Modes](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/s3_write_modes.ipynb) |
16 | 19 |
|
17 |
| ---- |
18 |
| - |
19 |
| -*Contents:* **[Use Cases](#Use-Cases)** | **[Installation](#Installation)** | **[Examples](#Examples)** | **[Diving Deep](#Diving-Deep)** | **[Step By Step](#Step-By-Step)** | **[Contributing](#Contributing)** |
20 |
| - |
21 |
| ---- |
| 20 | +## Contents |
| 21 | +- [Use Cases](#Use-Cases) |
| 22 | +- [Installation](#Installation) |
| 23 | +- [Examples](#Examples) |
| 24 | +- [Diving Deep](#Diving-Deep) |
| 25 | +- [Step By Step](#Step-By-Step) |
| 26 | +- [Contributing](#Contributing) |
22 | 27 |
|
23 | 28 | ## Use Cases
|
24 | 29 |
|
25 | 30 | ### Pandas
|
26 | 31 |
|
27 |
| -* Pandas -> Parquet (S3) (Parallel) |
28 |
| -* Pandas -> CSV (S3) (Parallel) |
29 |
| -* Pandas -> Glue Catalog Table |
30 |
| -* Pandas -> Athena (Parallel) |
31 |
| -* Pandas -> Redshift (Append/Overwrite/Upsert) (Parallel) |
32 |
| -* Pandas -> Aurora (MySQL/PostgreSQL) (Append/Overwrite) (Via S3) (NEW :star:) |
33 |
| -* Parquet (S3) -> Pandas (Parallel) |
34 |
| -* CSV (S3) -> Pandas (One shot or Batching) |
35 |
| -* Glue Catalog Table -> Pandas (Parallel) |
36 |
| -* Athena -> Pandas (One shot, Batching or Parallel) |
37 |
| -* Redshift -> Pandas (Parallel) |
38 |
| -* CloudWatch Logs Insights -> Pandas |
39 |
| -* Aurora -> Pandas (MySQL) (Via S3) (NEW :star:) |
40 |
| -* Encrypt Pandas Dataframes on S3 with KMS keys |
41 |
| -* Glue Databases Metadata -> Pandas (Jupyter output compatible) |
42 |
| -* Glue Table Metadata -> Pandas (Jupyter output compatible) |
| 32 | +| FROM | TO | Features | |
| 33 | +|--------------------------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 34 | +| Pandas DataFrame | Amazon S3 | Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes,<br>KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto) | |
| 35 | +| Amazon S3 | Pandas DataFrame| Parquet (Pushdown filters), CSV, Partitions, Parallelism,<br>KMS Encryption, Multiple files | |
| 36 | +| Amazon Athena | Pandas DataFrame| Workgroups, S3 output path, Encryption, and two different engines:<br><br>- ctas_approach=False **->** Batching and restrict memory environments<br>- ctas_approach=True **->** Blazing fast, parallelism and enhanced data types | |
| 37 | +| Pandas DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenes<br>Append/Overwrite/Upsert modes | |
| 38 | +| Amazon Redshift | Pandas DataFrame| Blazing fast using parallel parquet on S3 behind the scenes | |
| 39 | +| Pandas DataFrame | Amazon Aurora | Supported engines: MySQL, PostgreSQL<br>Blazing fast using parallel CSV on S3 behind the scenes<br>Append/Overwrite modes | |
| 40 | +| Amazon Aurora | Pandas DataFrame| Supported engines: MySQL<br>Blazing fast using parallel CSV on S3 behind the scenes | |
| 41 | +| CloudWatch Logs Insights | Pandas DataFrame| Query results | |
| 42 | +| Glue Catalog | Pandas DataFrame| List and get Tables details. Good fit with Jupyter Notebooks. | |
43 | 43 |
|
44 | 44 | ### PySpark
|
45 | 45 |
|
46 |
| -* PySpark -> Redshift (Parallel) |
47 |
| -* Register Glue table from Dataframe stored on S3 |
48 |
| -* Flatten nested DataFrames |
| 46 | +| FROM | TO | Features | |
| 47 | +|-----------------------------|---------------------------|------------------------------------------------------------------------------------------| |
| 48 | +| PySpark DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes | |
| 49 | +| PySpark DataFrame | Glue Catalog | Register Parquet or CSV DataFrame on Glue Catalog | |
| 50 | +| Nested PySpark<br>DataFrame | Flat PySpark<br>DataFrames| Flatten structs and break up arrays in child tables | |
49 | 51 |
|
50 | 52 | ### General
|
51 | 53 |
|
52 |
| -* List S3 objects (Parallel) |
53 |
| -* Delete S3 objects (Parallel) |
54 |
| -* Delete listed S3 objects (Parallel) |
55 |
| -* Delete NOT listed S3 objects (Parallel) |
56 |
| -* Copy listed S3 objects (Parallel) |
57 |
| -* Get the size of S3 objects (Parallel) |
58 |
| -* Get CloudWatch Logs Insights query results |
59 |
| -* Load partitions on Athena/Glue table (repair table) |
60 |
| -* Create EMR cluster (For humans) |
61 |
| -* Terminate EMR cluster |
62 |
| -* Get EMR cluster state |
63 |
| -* Submit EMR step(s) (For humans) |
64 |
| -* Get EMR step state |
65 |
| -* Get EMR step state |
66 |
| -* Athena query to receive the result as python primitives (*Iterable[Dict[str, Any]*) |
67 |
| -* Load and Unzip SageMaker jobs outputs |
68 |
| -* Load and Unzip SageMaker models |
69 |
| -* Redshift -> Parquet (S3) |
70 |
| -* Aurora -> CSV (S3) (MySQL) (NEW :star:) |
71 |
| -* Get Glue Metadata |
| 54 | +| Feature | Details | |
| 55 | +|---------------------------------------------|-------------------------------------| |
| 56 | +| List S3 objects | e.g. wr.s3.list_objects("s3://...") | |
| 57 | +| Delete S3 objects | Parallel | |
| 58 | +| Delete listed S3 objects | Parallel | |
| 59 | +| Delete NOT listed S3 objects | Parallel | |
| 60 | +| Copy listed S3 objects | Parallel | |
| 61 | +| Get the size of S3 objects | Parallel | |
| 62 | +| Get CloudWatch Logs Insights query results | | |
| 63 | +| Load partitions on Athena/Glue table | Through "MSCK REPAIR TABLE" | |
| 64 | +| Create EMR cluster | "For humans" | |
| 65 | +| Terminate EMR cluster | "For humans" | |
| 66 | +| Get EMR cluster state | "For humans" | |
| 67 | +| Submit EMR step(s) | "For humans" | |
| 68 | +| Get EMR step state | "For humans" | |
| 69 | +| Query Athena to receive python primitives | Returns *Iterable[Dict[str, Any]* | |
| 70 | +| Load and Unzip SageMaker jobs outputs | | |
| 71 | +| Dump Amazon Redshift as Parquet files on S3 | | |
| 72 | +| Dump Amazon Aurora as CSV files on S3 | Only for MySQL engine | |
72 | 73 |
|
73 | 74 | ## Installation
|
74 | 75 |
|
|
0 commit comments