Allow user to remove broadcast variables when they are no longer used#771
Allow user to remove broadcast variables when they are no longer used#771RongGu wants to merge 2 commits intomesos:masterfrom
Conversation
…iables across tasks or operations,especially when they are large. However, the current Spark does not allow user to remove those variables in one SparkContext. This becomes a major issue for long running Shark server which uses one SparkContext. To address this issue, this patch allow user to remove broadcast variables when they are no longer used. To remove a broadcast variable, users only need to call the Broadcast.rm(toClearSource:Boolean) methond,the broadcast variable across the slaves will be deleted. If toClearSource is set true, data source (e.g., file used by HttpServer) will be deleted too.
|
Thank you for your pull request. An admin will review this request soon. |
There was a problem hiding this comment.
Can we rename this function to remove, and toClearSource to releaseSource?
2.Add a parameter to determine whether block managers report the broadcast block to master or not.
|
Thank you for your pull request. An admin will review this request soon. |
|
Hi @RongGu , AFAIK Spark already has a time based automatic clean way in HttpBroadcast when spark.cleaner.ttl is enabled, this can mostly clean JobConf in HadoopRDD, But this mechanism has a issue with Spark Streaming (https://spark-project.atlassian.net/browse/STREAMING-38?jql=project%20%3D%20STREAMING), it would be a great help to use a memory track way to clean the broadcast var automatically, not the time based way. |
|
Hi, @jerryshao , Thanks for your comment. It is nice to make a automatic memory cleaner for broadcast variables. Nevertheless, the purpose of this patch is providing a removing broadcast API to users. These two things do not conflict in essence. For memory cleanup tasks, the lesson I learned is that, whatever program-monitoring mechanisms seems not better than clear the memory explicitly by users if possible. GC can not always be in time and it has overhead costs. Moreover, in this case, it is hard to determine whether a broadcast needed be used by users any more,TTL may lead to error as the issue in the Spark Streaming said. On the other side, it is a problem to leave large unused broadcast variables in memory, and users have no means to handle that. Therefore, ,here we provide a explicit removing broadcast method to users. |
In Spark, users can create broadcast variables to share read-only variables across tasks or operations,especially when they are large. However, the current Spark does not allow users to remove those variables in one SparkContext. This becomes a major issue for long running Shark servers which uses one SparkContext. To address this issue, this patch allows users to remove broadcast variables when they are no longer used. To remove a broadcast variable, users only need to call the Broadcast.rm(toClearSource:Boolean) methond, the broadcast variable across the slaves will be deleted. If toClearSource is set true, data source (e.g., file used by HttpServer) will be deleted too.