-
Notifications
You must be signed in to change notification settings - Fork 3
Add benchmark for PC and GES causal discovery algorithms on simulated Linear Gaussian data #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add benchmark for PC and GES causal discovery algorithms on simulated Linear Gaussian data #7
Conversation
@ankurankan can you pls review this? |
@Vanshitaaa20 Sorry for the delay in reviewing this. I still haven't got a chance to look at it. In the meantime, I would suggest restructuring your code similar to the code that we currently have for the Conditional Independence test benchmarks. So, the script should write a CSV file with all the results, and the frontend should read in the CSV and make the plots. @Nimish-4 @RudraCodesForU Would any of you be interested in reviewing this? |
num_trials = 10 | ||
shd_pc_list = [] | ||
shd_ges_list = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vanshitaaa20 , add the algo equations like the benchmarking script in the "doc string" form
from pgmpy.factors.continuous import LinearGaussianCPD | ||
from pgmpy.models import LinearGaussianBayesianNetwork as LGBN | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vanshitaaa20 , Add def of GES and PC in doc string form in the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@ankurankan, I will be reviewing it since the benchmarking and web part had been coded by me. @Vanshitaaa20 ,
@ankurankan , I am reviewing this , as the benchmark and web part is assigned to me. |
|
||
print(" SHD (PC):", shd_pc) | ||
print(" SHD (GES):", shd_ges) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a custom csv import instead of print , which gets stored in causalbench/results folder. Perform this changes first . @Vanshitaaa20
benchmarks/causal_discovery.py
Outdated
dag.add_node(f"X_{i}") | ||
return dag | ||
|
||
def compute_shd_direct(true_dag, learned_dag) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an implementation of SHD at pgmpy.metrics.SHD
.
@Vanshitaaa20 Any updates on this? |
I replaced the custom SHD function with pgmpy.metrics.SHD as suggested. However, I'm getting the following error during execution:
It seems that true_dag and learned_dag might have different sets of nodes. Should I align their node sets explicitly before computing SHD, or is there a preferred way to handle this? |
@Vanshitaaa20 could also potentially be a bug in the SHD method. Is it possible for you to print out the models that is resulting in the error? We can then run SHD on them and debug the issue. |
Sure, I’ll print the true_dag and learned_dag and share the ones causing the error so we can debug SHD on them directly. Will update soon. |
@Vanshitaaa20 Make sure to print both the edge list and the node list, as printing edges would leave out any disconnected nodes. Also, feel free to ask any questions on discord if you get stuck anywhere. |
@ankurankan sorry for the delay! i updated the SHD benchmarking code to print both the edge list and node list for easier debugging and alignment verification, and also added a skip_trial mechanism so that any trial raising a ValueError is skipped gracefully without interrupting the benchmarking run. |
benchmarks/causal_discovery.py
Outdated
for i in range(num_nodes): | ||
dag.add_node(f"X_{i}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to add the nodes here? Doesn't the get_random
method already give the DAG on the specified number of nodes?
benchmarks/causal_discovery.py
Outdated
lgbn = LGBN(true_dag.edges()) | ||
lgbn.add_nodes_from(true_dag.nodes()) | ||
for node in true_dag.nodes(): | ||
parents = list(lgbn.get_parents(node)) | ||
beta = [0.0] + list(np.random.uniform(0.5, 1.5, size=len(parents))) | ||
cpd = LinearGaussianCPD(variable=node, beta=beta, std=1, evidence=parents) | ||
lgbn.add_cpds(cpd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LinearGaussianBayesianNetwork
has a get_random
method that should give a full randomly generated model.
benchmarks/causal_discovery.py
Outdated
data = lgbn.simulate(n=1000) | ||
|
||
# PC Estimation | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do a try-except. Better to let it fail; it will help in detecting bugs.
benchmarks/causal_discovery.py
Outdated
continue | ||
|
||
# GES Estimation | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
benchmarks/causal_discovery.py
Outdated
# Ensure node alignment | ||
all_nodes = sorted(set(true_dag.nodes()).union( | ||
set(learned_dag_pc.nodes())).union(set(learned_dag_ges.nodes()))) | ||
true_dag.add_nodes_from(all_nodes) | ||
learned_dag_pc.add_nodes_from(all_nodes) | ||
learned_dag_ges.add_nodes_from(all_nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required? Both the true model and the learned dag would have the same nodes/variables as in the dataset. Right?
benchmarks/causal_discovery.py
Outdated
|
||
# Compute SHD using built-in method | ||
try: | ||
shd_pc = SHD(true_dag, learned_dag_pc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try-except should only be used when we know in what situation the code will throw an error, and that is the expected behavior. Not required here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vanshitaaa20 , Like @ankurankan said, remove try-catch exceptions , let the test cases fail, so that we could identify the potential causes and make changes on it.
benchmarks/causal_discovery.py
Outdated
|
||
# Compute SHD using built-in method | ||
try: | ||
shd_pc = SHD(true_dag, learned_dag_pc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vanshitaaa20 , Like @ankurankan said, remove try-catch exceptions , let the test cases fail, so that we could identify the potential causes and make changes on it.
@ankurankan @RudraCodesForU I previously tried fixing the error by explicitly aligning nodes across the graphs, but that doesn’t seem to help, so I removed that part. I also removed the try-except block so the actual error shows up directly. Could you please help me identify the real cause of this issue? |
I noticed that SHD throws a For example in one trial :-
|
@Vanshitaaa20 The error is because in Line 52 and Line 57, you are constructing
|
This PR introduces a new benchmark to evaluate the performance of PC and GES structure learning algorithms on Linear Gaussian Bayesian networks.