medicis123 on Hacker News

1

Show HN: Stop GPU pods placement getting bottlenecked by reserved VRAM

We have built a GPU Runtime for Nvidia GPUs that can run multiple development/experimental/inference workloads per GPU with safe overcommit of VRAM, dynamic fractional allocation of GPU cores, and Deduplication of weights in VRAM.

We are looking for teams to give it a try.

More details to get a trial license - https://www.woolyai.com.

2medicis1233mo ago0

2

A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling

Most GPU “sharing” solutions today (MIG, time-slicing, vGPU, etc.) still behave like partitions: you split the GPU or rotate workloads. That helps a bit, but it still leaves huge portions of the GPU idle and introduces jitter when multiple jobs compete.

We’ve come up with a different model, similar to how operating systems schedule tasks. Instead of carving up the GPU, we run multiple ML jobs inside a single shared GPU context and schedule their kernels directly. No slices, no preemption windows — just a deterministic, SLA-style kernel scheduler deciding which job’s kernels run when.

This results in the GPU behaving more like an always-on compute fabric rather than a dedicated device. SMs stay busy, memory stays warm, and high-priority jobs still get predictable latency. More details at https://woolyai.com/blog/a-new-approach-to-gpu-kernel-scheduling-for-higher-utilization/ Check out our technology at https://www.woolyai.com.

1medicis1236mo ago0

3

Show HN: Disaggregating GPU compute from CPU in ML job execution to scale GPUs (opens in new tab)

(woolyai.com)

1medicis1236mo ago0

4

Show HN: Run PyTorch on CPU boxes, offload kernels to remote GPUs

We have opened the WoolyAI GPU hypervisor trial to all.

https://woolyai.com/signup/

- Higher GPU utilization & lower cost Pack many jobs per GPU with WoolyAI’s server-side scheduler, VRAM deduplication, and SLO-aware controls. - GPU portability Run the same ML container on NVIDIA and AMD backends—no code changes. - Hardware flexibility Develop/run on CPU-only machines; execute kernels on your remote GPU pool.

1medicis1238mo ago0

5

Running Nvidia CUDA PyTorch container project/pipelines on AMD with no changes

Hi, I wanted to share some information on this cool feature we built in WoolyAI GPU hypervisor, which enables users to run their existing Nvidia CUDA pytorch/vLLM projects and pipelines without any modifications on AMD GPUs. ML researchers can transparently consume GPUs from a heterogeneous cluster of Nvidia and AMD GPUs. MLOps don't need to maintain separate pipelines or runtime dependencies. The ML team can scale capacity easily. Please share feedback, and we are also signing up Beta users. https://youtu.be/MTM61CB2IZc

1medicis1239mo ago0

6

GPU-accelerated code on CPU-only environments -Remote GPU Kernel Execution (opens in new tab)

(youtube.com)Video

1medicis1239mo ago1

7

Sharing base model in GPU VRAM across multiple inference stack process [video] (opens in new tab)

(youtube.com)Video

7medicis1239mo ago1

8

Sharing actual GPU core and VRAM utilization metrics for query on 10 LLM models (opens in new tab)

(woolyai.com)

1medicis1231y ago1

9

Show HN: WoolyAI-CUDA Abstraction Layer to Decouple Kernel Shader Exec on GPU (opens in new tab)

(woolyai.com)

4medicis1231y ago0

10

Locally delivered and centrally managed macOS envs for privileged access setup (opens in new tab)

(veertu.com)

1medicis1236y ago0

11

Shopify scaling iOS CI with Anka (opens in new tab)

(engineering.shopify.com)

1medicis1237y ago0

medicis123

Recent submissions

Show HN: Stop GPU pods placement getting bottlenecked by reserved VRAM

A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling

Show HN: Disaggregating GPU compute from CPU in ML job execution to scale GPUs (opens in new tab)

Show HN: Run PyTorch on CPU boxes, offload kernels to remote GPUs

Running Nvidia CUDA PyTorch container project/pipelines on AMD with no changes

GPU-accelerated code on CPU-only environments -Remote GPU Kernel Execution (opens in new tab)

Sharing base model in GPU VRAM across multiple inference stack process [video] (opens in new tab)

Sharing actual GPU core and VRAM utilization metrics for query on 10 LLM models (opens in new tab)

Show HN: WoolyAI-CUDA Abstraction Layer to Decouple Kernel Shader Exec on GPU (opens in new tab)

Locally delivered and centrally managed macOS envs for privileged access setup (opens in new tab)

Shopify scaling iOS CI with Anka (opens in new tab)

Recent submissions

Show HN: Stop GPU pods placement getting bottlenecked by reserved VRAM

A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling

Show HN: Disaggregating GPU compute from CPU in ML job execution to scale GPUs (opens in new tab)

Show HN: Run PyTorch on CPU boxes, offload kernels to remote GPUs

Running Nvidia CUDA PyTorch container project/pipelines on AMD with no changes

GPU-accelerated code on CPU-only environments -Remote GPU Kernel Execution (opens in new tab)

Sharing base model in GPU VRAM across multiple inference stack process [video] (opens in new tab)

Sharing actual GPU core and VRAM utilization metrics for query on 10 LLM models (opens in new tab)

Show HN: WoolyAI-CUDA Abstraction Layer to Decouple Kernel Shader Exec on GPU (opens in new tab)

Locally delivered and centrally managed macOS envs for privileged access setup (opens in new tab)

Shopify scaling iOS CI with Anka (opens in new tab)