2
2
3
3
> Utility belt to handle data on AWS.
4
4
5
- [ ![ Release] ( https://img.shields.io/badge/release-0.0.9 -brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
5
+ [ ![ Release] ( https://img.shields.io/badge/release-0.0.10 -brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
6
6
[ ![ Downloads] ( https://img.shields.io/pypi/dm/awswrangler.svg )] ( https://pypi.org/project/awswrangler/ )
7
7
[ ![ Python Version] ( https://img.shields.io/badge/python-3.6%20%7C%203.7-brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
8
8
[ ![ Documentation Status] ( https://readthedocs.org/projects/aws-data-wrangler/badge/?version=latest )] ( https://aws-data-wrangler.readthedocs.io/en/latest/?badge=latest )
9
- [ ![ Coverage] ( https://img.shields.io/badge/coverage-83 %25-brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
9
+ [ ![ Coverage] ( https://img.shields.io/badge/coverage-87 %25-brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
10
10
[ ![ Average time to resolve an issue] ( http://isitmaintained.com/badge/resolution/awslabs/aws-data-wrangler.svg )] ( http://isitmaintained.com/project/awslabs/aws-data-wrangler " Average time to resolve an issue ")
11
11
[ ![ License] ( https://img.shields.io/badge/License-Apache%202.0-blue.svg )] ( https://opensource.org/licenses/Apache-2.0 )
12
12
35
35
36
36
### PySpark
37
37
* PySpark -> Redshift (Parallel)
38
- * Register Glue table from Dataframe stored on S3 (NEW :star : )
38
+ * Register Glue table from Dataframe stored on S3
39
+ * Flatten nested DataFrames (NEW :star : )
39
40
40
41
### General
41
42
* List S3 objects (Parallel)
45
46
* Copy listed S3 objects (Parallel)
46
47
* Get the size of S3 objects (Parallel)
47
48
* Get CloudWatch Logs Insights query results
48
- * Load partitions on Athena/Glue table (repair table) (NEW :star : )
49
+ * Load partitions on Athena/Glue table (repair table)
50
+ * Create EMR cluster (For humans) (NEW :star : )
51
+ * Terminate EMR cluster (NEW :star : )
52
+ * Get EMR cluster state (NEW :star : )
53
+ * Submit EMR step (For humans) (NEW :star : )
54
+ * Ger EMR step state (NEW :star : )
49
55
50
56
## Installation
51
57
@@ -195,6 +201,16 @@ session.spark.create_glue_table(dataframe=dataframe,
195
201
database = " my_database" )
196
202
```
197
203
204
+ #### Flatten nested PySpark DataFrame
205
+
206
+ ``` py3
207
+ session = awswrangler.Session(spark_session = spark)
208
+ dfs = session.spark.flatten(df = df_nested)
209
+ for name, df_flat in dfs:
210
+ print (name)
211
+ df_flat.show()
212
+ ```
213
+
198
214
### General
199
215
200
216
#### Deleting a bunch of S3 objects (parallel)
@@ -221,6 +237,51 @@ session = awswrangler.Session()
221
237
session.athena.repair_table(database = " db_name" , table = " tbl_name" )
222
238
```
223
239
240
+ #### Create EMR cluster
241
+
242
+ ``` py3
243
+ session = awswrangler.Session()
244
+ cluster_id = session.emr.create_cluster(
245
+ cluster_name = " wrangler_cluster" ,
246
+ logging_s3_path = f " s3://BUCKET_NAME/emr-logs/ " ,
247
+ emr_release = " emr-5.27.0" ,
248
+ subnet_id = " SUBNET_ID" ,
249
+ emr_ec2_role = " EMR_EC2_DefaultRole" ,
250
+ emr_role = " EMR_DefaultRole" ,
251
+ instance_type_master = " m5.xlarge" ,
252
+ instance_type_core = " m5.xlarge" ,
253
+ instance_type_task = " m5.xlarge" ,
254
+ instance_ebs_size_master = 50 ,
255
+ instance_ebs_size_core = 50 ,
256
+ instance_ebs_size_task = 50 ,
257
+ instance_num_on_demand_master = 1 ,
258
+ instance_num_on_demand_core = 1 ,
259
+ instance_num_on_demand_task = 1 ,
260
+ instance_num_spot_master = 0 ,
261
+ instance_num_spot_core = 1 ,
262
+ instance_num_spot_task = 1 ,
263
+ spot_bid_percentage_of_on_demand_master = 100 ,
264
+ spot_bid_percentage_of_on_demand_core = 100 ,
265
+ spot_bid_percentage_of_on_demand_task = 100 ,
266
+ spot_provisioning_timeout_master = 5 ,
267
+ spot_provisioning_timeout_core = 5 ,
268
+ spot_provisioning_timeout_task = 5 ,
269
+ spot_timeout_to_on_demand_master = True ,
270
+ spot_timeout_to_on_demand_core = True ,
271
+ spot_timeout_to_on_demand_task = True ,
272
+ python3 = True ,
273
+ spark_glue_catalog = True ,
274
+ hive_glue_catalog = True ,
275
+ presto_glue_catalog = True ,
276
+ bootstraps_paths = None ,
277
+ debugging = True ,
278
+ applications = [" Hadoop" , " Spark" , " Ganglia" , " Hive" ],
279
+ visible_to_all_users = True ,
280
+ key_pair_name = None ,
281
+ )
282
+ print (cluster_id)
283
+ ```
284
+
224
285
## Diving Deep
225
286
226
287
0 commit comments