-
Notifications
You must be signed in to change notification settings - Fork 61
Closed
Description
Description of the bug
I am trying to create an image for the last boltz version (2.2.1), for this I updated the Dockerfile as follows:
FROM python:3.12-slim
LABEL authors="Ziad Al-Bkhetan <ziad.albkhetan@gmail.com>" \
title="nfcore/proteinfold_boltz" \
Version="1.2.0dev" \
description="Docker image containing all software requirements to run boltz using the nf-core/proteinfold pipeline"
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
procps \
&& apt-get autoremove -y \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir boltz==2.2.1
## about triton version, thought the error msg state to be sure of using 3.3.0
triton==3.3.0 \
cuequivariance_ops_cu12==0.7.0 \
cuequivariance_ops_torch_cu12==0.7.0 \
cuequivariance_torch==0.7.0
The cuequivariance_* and triton libraries are installed since otherwise, boltz complains.
When boltz is run it throws this exception:
^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/boltz/model/layers/pairformer.py", line 243, in forward
z = z + dropout * self.tri_mul_out(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/boltz/model/layers/triangular_mult.py", line 92, in forward
return kernel_triangular_mult(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/boltz/model/layers/triangular_mult.py", line 22, in kernel_triangular_mult
from cuequivariance_torch.primitives.triangle import triangle_multiplicative_update
File "/usr/local/lib/python3.12/site-packages/cuequivariance_torch/__init__.py", line 26, in <module>
from .primitives.transpose import TransposeSegments, TransposeIrrepsLayout
File "/usr/local/lib/python3.12/site-packages/cuequivariance_torch/primitives/transpose.py", line 183, in <module>
from cuequivariance_ops_torch import segmented_transpose
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops_torch/__init__.py", line 39, in <module>
from cuequivariance_ops_torch.fused_layer_norm_torch import layer_norm_transpose
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops_torch/fused_layer_norm_torch.py", line 17, in <module>
from cuequivariance_ops.triton import (
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops/triton/__init__.py", line 24, in <module>
from .tuning_decorator import autotune_aot
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops/triton/tuning_decorator.py", line 17, in <module>
from .cache_manager import get_cache_manager
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops/triton/cache_manager.py", line 255, in <module>
cache_manager = CacheManager()
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops/triton/cache_manager.py", line 110, in __init__
self.gpu_information = get_gpu_information()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/cuequivariance_ops/triton/cache_manager.py", line 71, in get_gpu_information
gpu_core_count = pynvml.nvmlDeviceGetNumGpuCores(handle)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pynvml.py", line 5872, in nvmlDeviceGetNumGpuCores
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.12/site-packages/pynvml.py", line 1061, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_Unknown: Unknown Error
The drivers installed on the HPC system are:
NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9
Steps to reproduce the error
apptainer pull --disable-cache --name quay.io-nf-core-proteinfold_boltz-test.img docker://quay.io/nf-core/proteinfold_boltz:test > /dev/null
apptainer shell --nv /users/cn/jespinosa/nxf_singuarlity_cachedir/quay.io-nf-core-proteinfold_boltz-test.img
python - << 'EOF'
import pynvml
pynvml.nvmlInit()
h = pynvml.nvmlDeviceGetHandleByIndex(0)
print("Name:", pynvml.nvmlDeviceGetName(h))
print("SM cores:", pynvml.nvmlDeviceGetNumGpuCores(h))
EOF
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working