cuD-PDLP by Bubullzz · Pull Request #1391 · NVIDIA/cuopt

Bubullzz · 2026-06-04T15:24:15Z

Implemented metis-partitionned multi-GPU PDLP.

To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with --distributed-pdlp-num-gpus n

All benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet

Here is the bottom line of the results
On 8 NVLINKed B200 :

against CuOpt :

speedup : at least 2.5x and up to 7.08x (tsp-gaia-10m.mps)
memory footprint : ~8x on most instances

against D-PDLP

speedup: slower on most instances but faster on the bigger ones (psr_100, tsp-gaia-10m, ELMOD_876_10_noVEname). getting up to a 2x speedup on ELMOD_876_10_noVEname.
memory footprint : they consistently have a better memory footprint than we do but on the bigger instances it does not go over 20% extra footprint

to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work

closes #891

…he cycle seems to be fixed, cuopt compiles

…olver !!!

+ style too

…k on main

compiles and runs

Bubullzz · 2026-06-30T15:15:29Z

/ok to test 7b6f96a

Bubullzz · 2026-06-30T15:20:22Z

/ok to test b7d4d91

Bubullzz · 2026-06-30T21:44:00Z

/ok to test 89c8878

Bubullzz · 2026-06-30T21:55:35Z

/ok to test 2fc3add

Kh4ster · 2026-07-01T08:03:31Z

    {CUOPT_MIP_OBJECTIVE_STEP, &mip_settings.objective_step, 0, 1, 1},
    {CUOPT_NUM_GPUS, &pdlp_settings.num_gpus, 1, 2, 1},
    {CUOPT_NUM_GPUS, &mip_settings.num_gpus, 1, 2, 1},
+    {CUOPT_DISTRIBUTED_PDLP_NUM_GPUS, &pdlp_settings.distributed_pdlp_num_gpus, -1, 576, -1},


Wont the setup break if we call with 576 GPUs in practice since we only support single process?

Kh4ster · 2026-07-01T08:08:41Z

    {CUOPT_PRESOLVE_FILE, &mip_settings.presolve_file, ""},
-    {CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""}
+    {CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""},
+    {CUOPT_MULTI_GPU_PARTITION_FILE, &pdlp_settings.multi_gpu_partition_file, ""},


Can you remind me what is the use case for this? Was it just for debugging or are we confident this will be useful in the future?

Kh4ster · 2026-07-01T08:14:29Z

+
+  // 3. Construct one shard per rank, pinned to its device. Ownership of each
+  //    communicator moves into its shard.
+  CUOPT_LOG_INFO("distributed_pdlp: building %d shard solver(s) ...", nb_parts);


We need to be careful with every thing we decide to log in the final product, could you please share what a full run with logs look like?

Kh4ster · 2026-07-01T08:14:52Z

+      devices[r], std::move(rank_data[r]), std::move(comms[r]), mps, sub_solver_settings));
+  }
+  auto shard_build_t1 = std::chrono::high_resolution_clock::now();
+  CUOPT_LOG_INFO("distributed_pdlp: shard build done in %.3f s",


Kh4ster · 2026-07-01T08:23:34Z

+  // Step 2: a single NCCL group with matched ncclSend / ncclRecv across all
+  // (rank, peer) pairs, receiving into each shard's halo region.
+  template <typename ShardBufAccess>
+  void halo_exchange_var_shard(ShardBufAccess&& buf_access)


Commenting it here but this is general to all functions in this file: if you have a .cu version of this file, was there a specific reason regarding putting the implementation in the hpp rather than in the cu directly?

Kh4ster · 2026-07-01T08:24:52Z

+template <typename i_t, typename f_t>
+std::vector<i_t> partition_loader_t<i_t, f_t>::parse_distributed_pdlp_partition_file(
+  std::string const& file)
+{


Same comment as on the parameter: will we need to keep this partition file logic in the final product?

Kh4ster · 2026-07-01T08:27:29Z

+namespace cuopt::mathematical_optimization::pdlp {
+
+template <typename i_t, typename f_t>
+std::vector<i_t> dummy_partitioner_t<i_t, f_t>::partition(


In practice I don't think this is used anywhere in the code so unless you want to keep it for future reference/test/benchmark I think it should be removed

Kh4ster · 2026-07-01T08:37:45Z

+                "kaminpar_partitioner: A_t.row_offsets size mismatch (expected nb_vars+1)");
+  cuopt_expects(A_cols.size() == A_t_cols.size(),
+                error_type_t::ValidationError,
+                "kaminpar_partitioner: A and A_t nnz mismatch");


any requirement regarding the sorting of the cols? If yes that should be checked

Kh4ster · 2026-07-01T08:43:48Z

+  auto t1         = std::chrono::high_resolution_clock::now();
+  const double dt = std::chrono::duration<double>(t1 - t0).count();
+
+  CUOPT_LOG_INFO(


same remark regarding logging

Kh4ster · 2026-07-01T08:47:35Z

+namespace cuopt::mathematical_optimization::pdlp {
+
+// Non-owning view of a host CSR matrix (A or A_t).
+template <typename i_t, typename f_t>


if this is non owning was there a reason to not use span instead of vector pointers?

Kh4ster · 2026-07-01T08:54:45Z

+  // Store as std::unique_ptr in any container.
+
+  int device_id;
+  rmm::cuda_stream stream;


Any reason why to explicetly create an rmm::cuda steram here while there is one in the raft handler?

Kh4ster · 2026-07-01T08:57:35Z

+
+  for (i_t i = 0; i < rank_data.owned_var_size; ++i) {
+    const auto g   = rank_data.local_to_global_var[i];
+    h_obj[i]       = maximize ? -g_obj[g] : g_obj[g];


should this maximization handling be done here and on at each problem level? are we currently testing if distributed correctly works on maximization problem?

Kh4ster · 2026-07-01T09:05:59Z

+
+  // Inject this shard's unscaled buffers into op_problem_scaled (distributed
+  // scaling runs later and will scale them).
+  auto& scaled = sub_pdlp->get_op_problem_scaled();


Wont pdlp solver already fill those through its constructor?

Kh4ster · 2026-07-01T09:16:51Z

+      rmm::device_uvector<i_t> idx(send_to_peer.size(), stream_view);
+      rmm::device_uvector<f_t> buf(send_to_peer.size(), stream_view);
+      if (!send_to_peer.empty()) {
+        raft::copy(idx.data(), send_to_peer.data(), send_to_peer.size(), stream_view);


You can't start an async copy on local rmm::device_uvector then go out of scope. See my linked slack conversation by message

Kh4ster · 2026-07-01T12:27:06Z

 }

+// Row inf-norm of the scaled matrix, over the row-major matrix: each row is
+// reduced from its own nonzeros. (Owns the complete row in distributed PDLP.)


we have discussed this via slack messages: we should document why we now use two kernels, one for rows and one for cols (not a big cost for single gpu but major time same for distributed)

Kh4ster · 2026-07-01T13:22:53Z

+                            : (kind == partitioner_kind_t::KaMinPar) ? "kaminpar"
+                                                                     : "unknown";
+    CUOPT_LOG_INFO(
+      "Partitioning %d constraints + %d variables into %d part(s) using the %s "


Same remark regarding logs

Bubullzz · 2026-07-01T14:55:25Z

/ok to test b85d44c

Bubullzz added 30 commits May 7, 2026 15:07

first commit !! added multi_gpu_partition file to solver settings

1e0bd53

slowly skeletonning

978d17b

better shard.cuh

dd0c0ef

wip

2037eca

added a bit of skeleton. Forward declared pdlp_solver in shard.hpp, t…

0f62eff

…he cycle seems to be fixed, cuopt compiles

still wip but going well

d89c85a

cursor broke everything grrr

5534ff0

partition loader now partition loads

dd935c5

big advancements ayo ! We can soon start working on imlementing the s…

09eb20b

…olver !!!

added pre loop setup need to manage boxing

b5ebfd2

+ style too

added distributed transform

0965a60

added semicolon and existing runtime error enum

d4d1cab

added } and fixed cuot_expects in partition loader

6659dd9

small bug fixes

b2ed271

a version that compiles #heheha 😎😎😎😎

50d16ce

removed use of engine:transaform

359d9f4

added multi-gpu SpMV #heheha

910a49a

transformed a transform. it compiles hehe

76c0b3f

updated take step for distributed. compiles but doesnt run. will chec…

5ec7138

…k on main

Merge branch 'main' into cuD-PDLP

1f02afd

support spmvop on multi-gpu

de19f38

compile ready

0030a6c

can run now

172ebc2

passing all tests, good merge

23d0798

fixed the errors hihi, finished distributed part for compte_fixed_error

30881ce

style

c33faf2

now manage halpern update in multi-gpu pdlp

98e0ce6

small fix to calls of multi_gpu_engine_ and scale/unscale solutions.

84128bf

compiles and runs

comments

abe4dd2

added is multi gpu to pdhg

5c41497

updated moved includes

b7d4d91

Bubullzz added 3 commits June 30, 2026 08:24

added host undo presolve

cdd9a4d

cleaned presolve and distributed works with presolve

2531c24

re-enabled pre solver in pdlp__test

89c8878

style

2fc3add

Kh4ster requested review from Kh4ster and removed request for hlinsen July 1, 2026 07:57

Kh4ster reviewed Jul 1, 2026

View reviewed changes

Merge branch 'main' into cuD-PDLP

b85d44c

silence kaminpar warnings

c5a4c5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuD-PDLP#1391

cuD-PDLP#1391
Bubullzz wants to merge 157 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP

Bubullzz commented Jun 4, 2026 •

edited

Loading

Bubullzz commented Jun 30, 2026

Bubullzz commented Jun 30, 2026

Bubullzz commented Jun 30, 2026

Bubullzz commented Jun 30, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Kh4ster Jul 1, 2026

Bubullzz commented Jul 1, 2026

Labels

3 participants

Uh oh!

Conversation

Bubullzz commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

against CuOpt :

against D-PDLP

Bubullzz commented Jun 30, 2026

Bubullzz commented Jun 30, 2026

Bubullzz commented Jun 30, 2026

Bubullzz commented Jun 30, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bubullzz commented Jul 1, 2026

Labels

3 participants

Bubullzz commented Jun 4, 2026 •

edited

Loading