Skip to content

Cylon container fails #679

@AymenFJA

Description

@AymenFJA

Hello @nirandaperera and Cylon team,

I was testing Cylon container with Kubernetes on AWS. I have a multi-node setup of MPI environment on the cluster.

I tested Cylon with 1 and 2 nodes (each node has 128 cores and 16GB of memory per core (total per node is 2048 GB)) both runs worked just fine when executing join operation with ~35M rows using the following script https://github.com/cylondata/cylon/blob/main/summit/scripts/cylon_scaling.py.

The command line that I used:

mpirun -n 256 cylon_scaling.py -s w -n 35000000

I repeated the same setup but this time with 3 or 4 nodes:

mpirun -n 384 cylon_scaling.py -s w -n 35000000
mpirun -n 512 cylon_scaling.py -s w -n 35000000

And I started getting the following error:

[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) fail[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)                                                                                                             [cylon-join-worker-1][[60663,1],146][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(258) failed: Bad file descriptor (9)
[cylon-join-worker-1:17259] *** Process received signal ***
[cylon-join-worker-1:17259] Signal: Segmentation fault (11)
[cylon-join-worker-1:17259] Signal code: Address not mapped (1)
[cylon-join-worker-1:17259] Failing at address: (nil)
[cylon-join-worker-1:17259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe751c04090]
[cylon-join-worker-1:17259] *** End of error message ***

[cylon-join-worker-1:17186] *** Process received signal ***
[cylon-join-worker-1:17186] Signal: Segmentation fault (11)
[cylon-join-worker-1:17186] Signal code: Address not mapped (1)
[cylon-join-worker-1:17186] Failing at address: 0x18
[cylon-join-worker-1:17186] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f50e0825090]
[cylon-join-worker-1:17186] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_tcp.so(mca_btl_tcp_endpoint_send+0x609)[0x7f50dbcdbfa9]
[cylon-join-worker-1:17186] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x1a1)[0x7f50db6d8bc1]
[cylon-join-worker-1:17186] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_isend+0x482)[0x7f50db6ca3a2]
[cylon-join-worker-1:17186] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Isend+0x12d)[0x7f50dd28893d]
[cylon-join-worker-1:17186] [ 5] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon10MPIChannel13progressSendsEv+0x15a)[0x7f5025e8703a]
[cylon-join-worker-1:17186] [ 6] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon8AllToAll10isCompleteEv+0x233)[0x7f5025e8d103]
[cylon-join-worker-1:17186] [ 7] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon13ArrowAllToAll10isCompleteEv+0x76a)[0x7f5025bb069a]
[cylon-join-worker-1:17186] [ 8] /cylon/build/lib/libcylon.so.0.6.0(+0x4ed43c)[0x7f5025ecc43c]
[cylon-join-worker-1:17186] [ 9] /cylon/build/lib/libcylon.so.0.6.0(+0x4ee323)[0x7f5025ecd323]
[cylon-join-worker-1:17186] [10] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon15DistributedJoinERKSt10shared_ptrINS_5TableEES4_RKNS_4join6config10JoinConfigERS2_+0x8a)[0x7f5025ece00a]
[cylon-join-worker-1:17186] [11] /cylon/ENV/lib/python3.8/site-packages/pycylon-0+untagged.1302.g44a27a6-py3.8-linux-x86_64.egg/pycylon/data/table.cpython-38-x86_64-linux-gnu.so(+0x75c02)[0x7f50db67ec02]
[cylon-join-worker-1:17186] [12] /cylon/ENV/bin/python3(PyCFunction_Call+0x59)[0x5f6939]
[cylon-join-worker-1:17186] [13] /cylon/ENV/bin/python3(_PyObject_MakeTpCall+0x296)[0x5f7506]
[cylon-join-worker-1:17186] [14] /cylon/ENV/bin/python3(_PyEval_EvalFrameDefault+0x6259)[0x571019]
[cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98)
cylon-join-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cylon-join-launcher:00001] 24 more processes have sent help message help-mpi-btl-tcp.txt / socket flag fail
[cylon-join-launcher:00001] 93 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[cylon-join-launcher:00001] 11 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail

Any help here would be appreciate it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions