cuD-PDLP#1391
Conversation
…he cycle seems to be fixed, cuopt compiles
+ style too
compiles and runs
|
/ok to test 7b6f96a |
|
/ok to test b7d4d91 |
|
/ok to test 89c8878 |
|
/ok to test 2fc3add |
| {CUOPT_MIP_OBJECTIVE_STEP, &mip_settings.objective_step, 0, 1, 1}, | ||
| {CUOPT_NUM_GPUS, &pdlp_settings.num_gpus, 1, 2, 1}, | ||
| {CUOPT_NUM_GPUS, &mip_settings.num_gpus, 1, 2, 1}, | ||
| {CUOPT_DISTRIBUTED_PDLP_NUM_GPUS, &pdlp_settings.distributed_pdlp_num_gpus, -1, 576, -1}, |
There was a problem hiding this comment.
Wont the setup break if we call with 576 GPUs in practice since we only support single process?
| {CUOPT_PRESOLVE_FILE, &mip_settings.presolve_file, ""}, | ||
| {CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""} | ||
| {CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""}, | ||
| {CUOPT_MULTI_GPU_PARTITION_FILE, &pdlp_settings.multi_gpu_partition_file, ""}, |
There was a problem hiding this comment.
Can you remind me what is the use case for this? Was it just for debugging or are we confident this will be useful in the future?
|
|
||
| // 3. Construct one shard per rank, pinned to its device. Ownership of each | ||
| // communicator moves into its shard. | ||
| CUOPT_LOG_INFO("distributed_pdlp: building %d shard solver(s) ...", nb_parts); |
There was a problem hiding this comment.
We need to be careful with every thing we decide to log in the final product, could you please share what a full run with logs look like?
| devices[r], std::move(rank_data[r]), std::move(comms[r]), mps, sub_solver_settings)); | ||
| } | ||
| auto shard_build_t1 = std::chrono::high_resolution_clock::now(); | ||
| CUOPT_LOG_INFO("distributed_pdlp: shard build done in %.3f s", |
| // Step 2: a single NCCL group with matched ncclSend / ncclRecv across all | ||
| // (rank, peer) pairs, receiving into each shard's halo region. | ||
| template <typename ShardBufAccess> | ||
| void halo_exchange_var_shard(ShardBufAccess&& buf_access) |
There was a problem hiding this comment.
Commenting it here but this is general to all functions in this file: if you have a .cu version of this file, was there a specific reason regarding putting the implementation in the hpp rather than in the cu directly?
| template <typename i_t, typename f_t> | ||
| std::vector<i_t> partition_loader_t<i_t, f_t>::parse_distributed_pdlp_partition_file( | ||
| std::string const& file) | ||
| { |
There was a problem hiding this comment.
Same comment as on the parameter: will we need to keep this partition file logic in the final product?
| namespace cuopt::mathematical_optimization::pdlp { | ||
|
|
||
| template <typename i_t, typename f_t> | ||
| std::vector<i_t> dummy_partitioner_t<i_t, f_t>::partition( |
There was a problem hiding this comment.
In practice I don't think this is used anywhere in the code so unless you want to keep it for future reference/test/benchmark I think it should be removed
| "kaminpar_partitioner: A_t.row_offsets size mismatch (expected nb_vars+1)"); | ||
| cuopt_expects(A_cols.size() == A_t_cols.size(), | ||
| error_type_t::ValidationError, | ||
| "kaminpar_partitioner: A and A_t nnz mismatch"); |
There was a problem hiding this comment.
any requirement regarding the sorting of the cols? If yes that should be checked
| auto t1 = std::chrono::high_resolution_clock::now(); | ||
| const double dt = std::chrono::duration<double>(t1 - t0).count(); | ||
|
|
||
| CUOPT_LOG_INFO( |
There was a problem hiding this comment.
same remark regarding logging
| namespace cuopt::mathematical_optimization::pdlp { | ||
|
|
||
| // Non-owning view of a host CSR matrix (A or A_t). | ||
| template <typename i_t, typename f_t> |
There was a problem hiding this comment.
if this is non owning was there a reason to not use span instead of vector pointers?
| // Store as std::unique_ptr in any container. | ||
|
|
||
| int device_id; | ||
| rmm::cuda_stream stream; |
There was a problem hiding this comment.
Any reason why to explicetly create an rmm::cuda steram here while there is one in the raft handler?
|
|
||
| for (i_t i = 0; i < rank_data.owned_var_size; ++i) { | ||
| const auto g = rank_data.local_to_global_var[i]; | ||
| h_obj[i] = maximize ? -g_obj[g] : g_obj[g]; |
There was a problem hiding this comment.
should this maximization handling be done here and on at each problem level? are we currently testing if distributed correctly works on maximization problem?
|
|
||
| // Inject this shard's unscaled buffers into op_problem_scaled (distributed | ||
| // scaling runs later and will scale them). | ||
| auto& scaled = sub_pdlp->get_op_problem_scaled(); |
There was a problem hiding this comment.
Wont pdlp solver already fill those through its constructor?
| rmm::device_uvector<i_t> idx(send_to_peer.size(), stream_view); | ||
| rmm::device_uvector<f_t> buf(send_to_peer.size(), stream_view); | ||
| if (!send_to_peer.empty()) { | ||
| raft::copy(idx.data(), send_to_peer.data(), send_to_peer.size(), stream_view); |
There was a problem hiding this comment.
You can't start an async copy on local rmm::device_uvector then go out of scope. See my linked slack conversation by message
| } | ||
|
|
||
| // Row inf-norm of the scaled matrix, over the row-major matrix: each row is | ||
| // reduced from its own nonzeros. (Owns the complete row in distributed PDLP.) |
There was a problem hiding this comment.
we have discussed this via slack messages: we should document why we now use two kernels, one for rows and one for cols (not a big cost for single gpu but major time same for distributed)
| : (kind == partitioner_kind_t::KaMinPar) ? "kaminpar" | ||
| : "unknown"; | ||
| CUOPT_LOG_INFO( | ||
| "Partitioning %d constraints + %d variables into %d part(s) using the %s " |
|
/ok to test b85d44c |
Implemented metis-partitionned multi-GPU PDLP.
To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with--distributed-pdlp-num-gpus nAll benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet
Here is the bottom line of the results
On 8 NVLINKed B200 :
against CuOpt :
against D-PDLP
to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work
closes #891