I was testing Cylon container with Kubernetes on AWS. I have a multi-node setup of MPI environment on the cluster.
mpirun -n 256 cylon_scaling.py -s w -n 35000000
mpirun -n 384 cylon_scaling.py -s w -n 35000000
mpirun -n 512 cylon_scaling.py -s w -n 35000000
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) fail[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98) [cylon-join-worker-1][[60663,1],146][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(258) failed: Bad file descriptor (9)
[cylon-join-worker-1:17259] *** Process received signal ***
[cylon-join-worker-1:17259] Signal: Segmentation fault (11)
[cylon-join-worker-1:17259] Signal code: Address not mapped (1)
[cylon-join-worker-1:17259] Failing at address: (nil)
[cylon-join-worker-1:17259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe751c04090]
[cylon-join-worker-1:17259] *** End of error message ***
[cylon-join-worker-1:17186] *** Process received signal ***
[cylon-join-worker-1:17186] Signal: Segmentation fault (11)
[cylon-join-worker-1:17186] Signal code: Address not mapped (1)
[cylon-join-worker-1:17186] Failing at address: 0x18
[cylon-join-worker-1:17186] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f50e0825090]
[cylon-join-worker-1:17186] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_tcp.so(mca_btl_tcp_endpoint_send+0x609)[0x7f50dbcdbfa9]
[cylon-join-worker-1:17186] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x1a1)[0x7f50db6d8bc1]
[cylon-join-worker-1:17186] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_isend+0x482)[0x7f50db6ca3a2]
[cylon-join-worker-1:17186] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Isend+0x12d)[0x7f50dd28893d]
[cylon-join-worker-1:17186] [ 5] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon10MPIChannel13progressSendsEv+0x15a)[0x7f5025e8703a]
[cylon-join-worker-1:17186] [ 6] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon8AllToAll10isCompleteEv+0x233)[0x7f5025e8d103]
[cylon-join-worker-1:17186] [ 7] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon13ArrowAllToAll10isCompleteEv+0x76a)[0x7f5025bb069a]
[cylon-join-worker-1:17186] [ 8] /cylon/build/lib/libcylon.so.0.6.0(+0x4ed43c)[0x7f5025ecc43c]
[cylon-join-worker-1:17186] [ 9] /cylon/build/lib/libcylon.so.0.6.0(+0x4ee323)[0x7f5025ecd323]
[cylon-join-worker-1:17186] [10] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon15DistributedJoinERKSt10shared_ptrINS_5TableEES4_RKNS_4join6config10JoinConfigERS2_+0x8a)[0x7f5025ece00a]
[cylon-join-worker-1:17186] [11] /cylon/ENV/lib/python3.8/site-packages/pycylon-0+untagged.1302.g44a27a6-py3.8-linux-x86_64.egg/pycylon/data/table.cpython-38-x86_64-linux-gnu.so(+0x75c02)[0x7f50db67ec02]
[cylon-join-worker-1:17186] [12] /cylon/ENV/bin/python3(PyCFunction_Call+0x59)[0x5f6939]
[cylon-join-worker-1:17186] [13] /cylon/ENV/bin/python3(_PyObject_MakeTpCall+0x296)[0x5f7506]
[cylon-join-worker-1:17186] [14] /cylon/ENV/bin/python3(_PyEval_EvalFrameDefault+0x6259)[0x571019]
[cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98)
cylon-join-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cylon-join-launcher:00001] 24 more processes have sent help message help-mpi-btl-tcp.txt / socket flag fail
[cylon-join-launcher:00001] 93 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[cylon-join-launcher:00001] 11 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
Any help here would be appreciate it.
Hello @nirandaperera and Cylon team,
I was testing Cylon container with Kubernetes on AWS. I have a multi-node setup of MPI environment on the cluster.
I tested Cylon with 1 and 2 nodes (each node has 128 cores and 16GB of memory per core (total per node is 2048 GB)) both runs worked just fine when executing
joinoperation with ~35Mrows using the following script https://github.com/cylondata/cylon/blob/main/summit/scripts/cylon_scaling.py.The command line that I used:
I repeated the same setup but this time with 3 or 4 nodes:
And I started getting the following error:
Any help here would be appreciate it.