I set up 2 A8s in an availability set running SLES-HPC 12 (following the tutorial here: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-cluster-rdma/).
When I run the intel MPI pingpong test, I am getting DAPL errors:
azureUser@sshvm0:~> /opt/intel/impi/5.0.3.048/bin64/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 1 -n 2 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DYNAMIC_CONNECTION=0 -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 /opt/intel/impi/5.0.3.048/bin64/IMB-MPI1 pingpong sshvm1:d28:bef0eb40: 12930 us(12930 us): dapl_rdma_accept: ERR -1 Input/output error sshvm1:d28:bef0eb40: 12946 us(16 us): DAPL ERR accept Input/output error [1:10.0.0.5][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:622] error(0x40000): ofa-v2-ib0: could not accept DAPL connection request: DAT_INTERNAL_ERROR() Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 622: 0 internal ABORT - process 0
Similar errors when running one of the OSU MPI benchmarks:
azureUser@sshvm0:~> /opt/intel/impi/5.0.3.048/bin64/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 1 -n 2 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DYNAMIC_CONNECTION=0 -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 /opt/intel/impi/5.0.3.048/bin64/IMB-MPI1 pingpong sshvm1:d28:bef0eb40: 12930 us(12930 us): dapl_rdma_accept: ERR -1 Input/output error sshvm1:d28:bef0eb40: 12946 us(16 us): DAPL ERR accept Input/output error [1:10.0.0.5][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:622] error(0x40000): ofa-v2-ib0: could not accept DAPL connection request: DAT_INTERNAL_ERROR() Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 622: 0 internal ABORT - process 0
Any tips on how to start debugging this? Thanks