Spark preprocessor optimization #123

basavaraj29 · 2022-11-07T18:54:35Z

removing id assignment for edges
using zipwithindex instead of repartition(1) and windowing
parititonBy([src_bucket, dst_bucket])

todo:

custom binary writer to eliminate intermediate csv

…tion(1) and windowing

shivaram · 2022-11-07T18:57:55Z

This is great. Do we have any numbers on how much this improves pre-processing?

basavaraj29 · 2022-11-07T19:19:23Z

on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex.

thodrek · 2022-11-07T20:54:23Z

Excellent work!

…

Sent from my iPhone On Nov 7, 2022, at 8:20 PM, Basava Kolagani ***@***.***> wrote: on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex. — Reply to this email directly, view it on GitHub<#123 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAH6W333EQFD32BOKLN4BXTWHFI4LANCNFSM6AAAAAARZP6PZA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> [ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#123 (comment)", "url": "#123 (comment)", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

removing id assignment to edges, using zipwithindex instead of repati…

cbc9b0a

…tion(1) and windowing

basavaraj29 requested a review from JasonMoho November 7, 2022 18:54

fixed lint issues

dabdabf

partition by both src and dst bucket id

8e1a31f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark preprocessor optimization #123

Spark preprocessor optimization #123

Uh oh!

basavaraj29 commented Nov 7, 2022 •

edited

Loading

Uh oh!

shivaram commented Nov 7, 2022

Uh oh!

basavaraj29 commented Nov 7, 2022

Uh oh!

thodrek commented Nov 7, 2022 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Spark preprocessor optimization #123

Are you sure you want to change the base?

Spark preprocessor optimization #123

Uh oh!

Conversation

basavaraj29 commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented Nov 7, 2022

Uh oh!

basavaraj29 commented Nov 7, 2022

Uh oh!

thodrek commented Nov 7, 2022 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

basavaraj29 commented Nov 7, 2022 •

edited

Loading