I think there's a bug in the IPCW weights used for Uno's C-index.
In pillar/datasets/nlst_utils.py, get_censoring_dist does:
kmf.fit(times, event_observed)
where event_observed = d["y"] (1 = cancer within follow-up, 0 = no cancer). This fits the KM to the disease process, so the resulting curve is the event survival S(t), not the censoring survival G(t).
Then in pillar/metrics/survival.py, _handle_pairs uses it as:
weight = 1.0 / (censoring_dist[str(int(truth_time))] ** 2)
which gives 1/S(t)^2 instead of the 1/G(t)^2 that Uno's formula calls for.
I believe the fix is just flipping the event indicator in the KM fit to event_observed = 1 - d["y"] so it estimates the censoring process instead.
I noticed this while comparing dev vs test AUCs across years -- the gap grows with follow-up time indicating a bias for future risk, caused by the monitor metric. That said, in a screening cohort like NLST where cancer is rare, S(t) stays high (~0.95) while G(t) probably drops faster, so fixing this would probably actually increase late-year weights. The fix is still correct per Uno but worth flagging. I know that this implementation was directly adapted from Sybil, which from what I can tell has the same implementation.
Am I reading this right or is the current implementation intentional?
I think there's a bug in the IPCW weights used for Uno's C-index.
In
pillar/datasets/nlst_utils.py,get_censoring_distdoes:kmf.fit(times, event_observed)where
event_observed = d["y"](1 = cancer within follow-up, 0 = no cancer). This fits the KM to the disease process, so the resulting curve is the event survival S(t), not the censoring survival G(t).Then in pillar/metrics/survival.py, _handle_pairs uses it as:
weight = 1.0 / (censoring_dist[str(int(truth_time))] ** 2)which gives 1/S(t)^2 instead of the 1/G(t)^2 that Uno's formula calls for.
I believe the fix is just flipping the event indicator in the KM fit to event_observed = 1 - d["y"] so it estimates the censoring process instead.
I noticed this while comparing dev vs test AUCs across years -- the gap grows with follow-up time indicating a bias for future risk, caused by the monitor metric. That said, in a screening cohort like NLST where cancer is rare, S(t) stays high (~0.95) while G(t) probably drops faster, so fixing this would probably actually increase late-year weights. The fix is still correct per Uno but worth flagging. I know that this implementation was directly adapted from Sybil, which from what I can tell has the same implementation.
Am I reading this right or is the current implementation intentional?