22
33> Utility belt to handle data on AWS.
44
5- [ ![ Release] ( https://img.shields.io/badge/release-0.0.9 -brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
5+ [ ![ Release] ( https://img.shields.io/badge/release-0.0.10 -brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
66[ ![ Downloads] ( https://img.shields.io/pypi/dm/awswrangler.svg )] ( https://pypi.org/project/awswrangler/ )
77[ ![ Python Version] ( https://img.shields.io/badge/python-3.6%20%7C%203.7-brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
88[ ![ Documentation Status] ( https://readthedocs.org/projects/aws-data-wrangler/badge/?version=latest )] ( https://aws-data-wrangler.readthedocs.io/en/latest/?badge=latest )
9- [ ![ Coverage] ( https://img.shields.io/badge/coverage-83 %25-brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
9+ [ ![ Coverage] ( https://img.shields.io/badge/coverage-87 %25-brightgreen.svg )] ( https://pypi.org/project/awswrangler/ )
1010[ ![ Average time to resolve an issue] ( http://isitmaintained.com/badge/resolution/awslabs/aws-data-wrangler.svg )] ( http://isitmaintained.com/project/awslabs/aws-data-wrangler " Average time to resolve an issue ")
1111[ ![ License] ( https://img.shields.io/badge/License-Apache%202.0-blue.svg )] ( https://opensource.org/licenses/Apache-2.0 )
1212
3535
3636### PySpark
3737* PySpark -> Redshift (Parallel)
38- * Register Glue table from Dataframe stored on S3 (NEW :star : )
38+ * Register Glue table from Dataframe stored on S3
39+ * Flatten nested DataFrames (NEW :star : )
3940
4041### General
4142* List S3 objects (Parallel)
4546* Copy listed S3 objects (Parallel)
4647* Get the size of S3 objects (Parallel)
4748* Get CloudWatch Logs Insights query results
48- * Load partitions on Athena/Glue table (repair table) (NEW :star : )
49+ * Load partitions on Athena/Glue table (repair table)
50+ * Create EMR cluster (For humans) (NEW :star : )
51+ * Terminate EMR cluster (NEW :star : )
52+ * Get EMR cluster state (NEW :star : )
53+ * Submit EMR step (For humans) (NEW :star : )
54+ * Ger EMR step state (NEW :star : )
4955
5056## Installation
5157
@@ -195,6 +201,16 @@ session.spark.create_glue_table(dataframe=dataframe,
195201 database = " my_database" )
196202```
197203
204+ #### Flatten nested PySpark DataFrame
205+
206+ ``` py3
207+ session = awswrangler.Session(spark_session = spark)
208+ dfs = session.spark.flatten(df = df_nested)
209+ for name, df_flat in dfs:
210+ print (name)
211+ df_flat.show()
212+ ```
213+
198214### General
199215
200216#### Deleting a bunch of S3 objects (parallel)
@@ -221,6 +237,51 @@ session = awswrangler.Session()
221237session.athena.repair_table(database = " db_name" , table = " tbl_name" )
222238```
223239
240+ #### Create EMR cluster
241+
242+ ``` py3
243+ session = awswrangler.Session()
244+ cluster_id = session.emr.create_cluster(
245+ cluster_name = " wrangler_cluster" ,
246+ logging_s3_path = f " s3://BUCKET_NAME/emr-logs/ " ,
247+ emr_release = " emr-5.27.0" ,
248+ subnet_id = " SUBNET_ID" ,
249+ emr_ec2_role = " EMR_EC2_DefaultRole" ,
250+ emr_role = " EMR_DefaultRole" ,
251+ instance_type_master = " m5.xlarge" ,
252+ instance_type_core = " m5.xlarge" ,
253+ instance_type_task = " m5.xlarge" ,
254+ instance_ebs_size_master = 50 ,
255+ instance_ebs_size_core = 50 ,
256+ instance_ebs_size_task = 50 ,
257+ instance_num_on_demand_master = 1 ,
258+ instance_num_on_demand_core = 1 ,
259+ instance_num_on_demand_task = 1 ,
260+ instance_num_spot_master = 0 ,
261+ instance_num_spot_core = 1 ,
262+ instance_num_spot_task = 1 ,
263+ spot_bid_percentage_of_on_demand_master = 100 ,
264+ spot_bid_percentage_of_on_demand_core = 100 ,
265+ spot_bid_percentage_of_on_demand_task = 100 ,
266+ spot_provisioning_timeout_master = 5 ,
267+ spot_provisioning_timeout_core = 5 ,
268+ spot_provisioning_timeout_task = 5 ,
269+ spot_timeout_to_on_demand_master = True ,
270+ spot_timeout_to_on_demand_core = True ,
271+ spot_timeout_to_on_demand_task = True ,
272+ python3 = True ,
273+ spark_glue_catalog = True ,
274+ hive_glue_catalog = True ,
275+ presto_glue_catalog = True ,
276+ bootstraps_paths = None ,
277+ debugging = True ,
278+ applications = [" Hadoop" , " Spark" , " Ganglia" , " Hive" ],
279+ visible_to_all_users = True ,
280+ key_pair_name = None ,
281+ )
282+ print (cluster_id)
283+ ```
284+
224285## Diving Deep
225286
226287
0 commit comments