disperse_d

A Kubernetes-native control plane that turns raw GPUs across providers into consistent AI clusters running Slurm, Ray and other industry-standard ML tools.

Core capabilities

Built for the Hard Parts of AI Infrastructure.

disperse_d automates the hardest parts of running AI clusters based on Slurm or Ray across distributed GPU infrastructure — provisioning schedulers, moving data, enforcing governance, and operating clusters consistently on Kubernetes.

Pre-Flight Infrastructure Validation

Deep infrastructure checks before cluster deployment. Validates networking, storage, GPUs, and runtime dependencies — and automatically fixes common issues.

Automated Cluster Provisioning

Turn any Kubernetes GPU fleet into a production AI cluster. disperse_d deploys schedulers like Slurm or Ray and configures networking, policies, and runtimes automatically.

Unified Multi-Provider Control

Manage clusters across clouds and regions from one control plane. Standardize provisioning and operations via UI, API, or Terraform.

High-Performance Data Mover

High-throughput dataset transfer from S3-compatible storage to AI clusters. On-demand synchronization ensures training jobs always have the data they need.

Governance & Capacity Allocation

Allocate GPU capacity across teams with quotas and fair-share policies. Integrates with LDAP and Active Directory for enterprise identity.

API-First Infrastructure

Your AI infrastructure across providers managed through unified Kubernetes-native APIs

disperse_d is built for infrastructure teams. Provision clusters, synchronize data, monitor health, and govern capacity through a simple API or CLI.

Integrate disperse_d into existing automation workflows, internal platforms, or Terraform infrastructure.

Provision an AI cluster

Turn any Kubernetes-backed GPU fleet into a production-ready AI cluster. disperse_d deploys standard schedulers like Slurm or Ray and configures networking, policies, and runtime automatically.

Sync data to compute

Move datasets from any S3-compatible storage directly to your AI clusters. Fast, repeatable data synchronization ensures training jobs always have the data they need.

Capacity governance

Define organizational hierarchy, sync it from LDAP or Active Directory, and assign fair-share GPU allocations across teams and projects.

Cluster operations

Built-in health checks continuously validate GPUs, nodes, and runtime stability. disperse_d automatically identifies failing nodes, cordons them, and drains workloads to keep clusters healthy.

deploy

$ |

--name slurm-prod-01 \

--kubeconfig eks-cluster-01.yaml

* Running pre-flight infrastructure checks...done

* Validating GPU runtime and networking...done

* Deploying Slurm control plane...done

* Configuring scheduling policies...done

* Setting up cluster health checks...done

✓ AI cluster ready → slurm-prod

scheduler: slurm | nodes: 32 | gpu: a100 | status: healthy

$ |

--source s3://research-datasets \

--cluster slurm-prod \

--target /datasets/research

Validating source credentials...ok

Resolving target cluster storage...ok

Planning parallel transfer...ok

Syncing 18,432 objects...done

Running integrity verification...done

✓ Dataset ready on cluster → slurm-prod:/datasets/research

source: s3://research-datasets | objects: 18,432 | throughput: 21.8 GB/s | status: ready

$ |

--source ldap://ldap.big-research.com/

Connecting to LDAP directory...ok

Importing hierarchy...done

✓ Organization created → id: corp-ai

$ disperse_d allocation set \

--org corp-ai \

--node /research/foundation-models \

--fairshare 40

Applying fair-share allocation...done

✓ Allocation updated → /research/foundation-models = 40%

$ disperse_d governance sync \ --org corp-ai --cluster slurm-prod

Syncing hierarchy and policies...done

✓ Governance synchronized → cluster: slurm-prod

$ |

--cluster slurm-prod-01 \

--follow true

[10:42:11] Health check failed on node worker-17 (gpu runtime)

[10:42:12] Applying remediation policy: cordon nodedone

[10:42:15] Draining workloads from worker-17done

[10:42:21] Marking node for break-fix workflowdone

✓ Cluster stable

nodes: 32 | unhealthy: 0 | remediated: 1 | status: healthy