Skip to content

cuD-PDLP#1391

Open
Bubullzz wants to merge 157 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP
Open

cuD-PDLP#1391
Bubullzz wants to merge 157 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP

Conversation

@Bubullzz

@Bubullzz Bubullzz commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implemented metis-partitionned multi-GPU PDLP.

To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with --distributed-pdlp-num-gpus n

All benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet

Here is the bottom line of the results
On 8 NVLINKed B200 :

against CuOpt :

  • speedup : at least 2.5x and up to 7.08x (tsp-gaia-10m.mps)
  • memory footprint : ~8x on most instances

against D-PDLP

  • speedup: slower on most instances but faster on the bigger ones (psr_100, tsp-gaia-10m, ELMOD_876_10_noVEname). getting up to a 2x speedup on ELMOD_876_10_noVEname.
  • memory footprint : they consistently have a better memory footprint than we do but on the bigger instances it does not go over 20% extra footprint

to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work

closes #891

Bubullzz added 30 commits May 7, 2026 15:07
@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 7b6f96a

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test b7d4d91

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 89c8878

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 2fc3add

@Kh4ster Kh4ster requested review from Kh4ster and removed request for hlinsen July 1, 2026 07:57
{CUOPT_MIP_OBJECTIVE_STEP, &mip_settings.objective_step, 0, 1, 1},
{CUOPT_NUM_GPUS, &pdlp_settings.num_gpus, 1, 2, 1},
{CUOPT_NUM_GPUS, &mip_settings.num_gpus, 1, 2, 1},
{CUOPT_DISTRIBUTED_PDLP_NUM_GPUS, &pdlp_settings.distributed_pdlp_num_gpus, -1, 576, -1},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont the setup break if we call with 576 GPUs in practice since we only support single process?

{CUOPT_PRESOLVE_FILE, &mip_settings.presolve_file, ""},
{CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""}
{CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""},
{CUOPT_MULTI_GPU_PARTITION_FILE, &pdlp_settings.multi_gpu_partition_file, ""},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me what is the use case for this? Was it just for debugging or are we confident this will be useful in the future?


// 3. Construct one shard per rank, pinned to its device. Ownership of each
// communicator moves into its shard.
CUOPT_LOG_INFO("distributed_pdlp: building %d shard solver(s) ...", nb_parts);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful with every thing we decide to log in the final product, could you please share what a full run with logs look like?

devices[r], std::move(rank_data[r]), std::move(comms[r]), mps, sub_solver_settings));
}
auto shard_build_t1 = std::chrono::high_resolution_clock::now();
CUOPT_LOG_INFO("distributed_pdlp: shard build done in %.3f s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

// Step 2: a single NCCL group with matched ncclSend / ncclRecv across all
// (rank, peer) pairs, receiving into each shard's halo region.
template <typename ShardBufAccess>
void halo_exchange_var_shard(ShardBufAccess&& buf_access)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting it here but this is general to all functions in this file: if you have a .cu version of this file, was there a specific reason regarding putting the implementation in the hpp rather than in the cu directly?

template <typename i_t, typename f_t>
std::vector<i_t> partition_loader_t<i_t, f_t>::parse_distributed_pdlp_partition_file(
std::string const& file)
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on the parameter: will we need to keep this partition file logic in the final product?

namespace cuopt::mathematical_optimization::pdlp {

template <typename i_t, typename f_t>
std::vector<i_t> dummy_partitioner_t<i_t, f_t>::partition(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice I don't think this is used anywhere in the code so unless you want to keep it for future reference/test/benchmark I think it should be removed

"kaminpar_partitioner: A_t.row_offsets size mismatch (expected nb_vars+1)");
cuopt_expects(A_cols.size() == A_t_cols.size(),
error_type_t::ValidationError,
"kaminpar_partitioner: A and A_t nnz mismatch");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any requirement regarding the sorting of the cols? If yes that should be checked

auto t1 = std::chrono::high_resolution_clock::now();
const double dt = std::chrono::duration<double>(t1 - t0).count();

CUOPT_LOG_INFO(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark regarding logging

namespace cuopt::mathematical_optimization::pdlp {

// Non-owning view of a host CSR matrix (A or A_t).
template <typename i_t, typename f_t>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is non owning was there a reason to not use span instead of vector pointers?

// Store as std::unique_ptr in any container.

int device_id;
rmm::cuda_stream stream;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why to explicetly create an rmm::cuda steram here while there is one in the raft handler?


for (i_t i = 0; i < rank_data.owned_var_size; ++i) {
const auto g = rank_data.local_to_global_var[i];
h_obj[i] = maximize ? -g_obj[g] : g_obj[g];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this maximization handling be done here and on at each problem level? are we currently testing if distributed correctly works on maximization problem?


// Inject this shard's unscaled buffers into op_problem_scaled (distributed
// scaling runs later and will scale them).
auto& scaled = sub_pdlp->get_op_problem_scaled();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont pdlp solver already fill those through its constructor?

rmm::device_uvector<i_t> idx(send_to_peer.size(), stream_view);
rmm::device_uvector<f_t> buf(send_to_peer.size(), stream_view);
if (!send_to_peer.empty()) {
raft::copy(idx.data(), send_to_peer.data(), send_to_peer.size(), stream_view);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't start an async copy on local rmm::device_uvector then go out of scope. See my linked slack conversation by message

}

// Row inf-norm of the scaled matrix, over the row-major matrix: each row is
// reduced from its own nonzeros. (Owns the complete row in distributed PDLP.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have discussed this via slack messages: we should document why we now use two kernels, one for rows and one for cols (not a big cost for single gpu but major time same for distributed)

Comment thread cpp/src/pdlp/pdlp.cu
: (kind == partitioner_kind_t::KaMinPar) ? "kaminpar"
: "unknown";
CUOPT_LOG_INFO(
"Partitioning %d constraints + %d variables into %d part(s) using the %s "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark regarding logs

@Bubullzz

Bubullzz commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b85d44c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change pdlp

3 participants