TL;DR

A developer built a Linux kernel module that enables consumer AMD mini PCs with USB4/Thunderbolt ports to emulate InfiniBand devices. This allows high-speed, low-latency communication suitable for AI training and inference across home setups, bypassing enterprise networking gear.

A developer has created an experimental Linux kernel module that enables consumer AMD mini PCs equipped with USB4/Thunderbolt ports to emulate InfiniBand devices, achieving high-speed RDMA communication suitable for AI workloads at home. This breakthrough could significantly reduce the need for enterprise networking gear in AI training and inference setups.

The developer built a custom kernel module that makes USB4/Thunderbolt ports on AMD mini PCs appear as InfiniBand devices, allowing for direct remote memory access (RDMA) over consumer hardware. The experimental setup demonstrated bidirectional data transfer speeds of approximately 95 Gb/s with latency around 7 microseconds, enabling workloads like tensor-parallel inference and Fully Sharded Data Parallel (FSDP) training to run across multiple consumer machines.

In practical tests, this setup achieved a MiniMax inference run that exceeded the capacity of a single machine and reduced FSDP training time from over 21 minutes to just over 2 minutes. The tests used a pair of 128GB AMD Strix Halo mini PCs connected via four USB4/Thunderbolt host channel adapters (HCAs), with performance metrics surpassing typical Ethernet or soft-RoCE configurations. The developer emphasized that this is research code, experimental in nature, with no support or warranty, and likely contains false assumptions or sharp edges.

Why It Matters

This development matters because it demonstrates a potential pathway for high-performance, low-cost AI training and inference at home, bypassing expensive enterprise networking equipment. If further refined, such technology could democratize access to advanced AI workloads, enabling researchers and hobbyists to perform large-scale distributed AI tasks without enterprise-grade infrastructure. It also showcases the potential of leveraging consumer hardware for tasks traditionally reserved for data centers.

Mini eGPU Enclosure Compatible with Thunderbolt 3/4, USB4 40Gbps External GPU Dock Station, Compatible with NVIDIA/AMD PCIe, PD 85W Charging Support, Daisy Chain, DC/ATX/SFX Support

Mini eGPU Enclosure Compatible with Thunderbolt 3/4, USB4 40Gbps External GPU Dock Station, Compatible with NVIDIA/AMD PCIe, PD 85W Charging Support, Daisy Chain, DC/ATX/SFX Support

NOTICES BEFORE PURCHASING: Before purchasing, please check whether your hardware system supports USB4, Thunderbolt 3 or Thunderbolt 4….

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Recent years have seen increasing interest in high-speed interconnects like InfiniBand for AI workloads, but these have remained largely inaccessible outside enterprise environments. The developer’s work builds on existing research into RDMA over Ethernet and Thunderbolt, pushing these concepts into consumer hardware territory. Prior efforts have focused on software-defined networking and RDMA over standard Ethernet or specialized hardware; this project extends that to USB4/Thunderbolt, which is common in modern mini PCs and laptops.

While experimental, this effort aligns with broader trends toward democratizing AI infrastructure, making high-speed interconnects feasible for smaller-scale setups. The developer’s tests suggest that with further refinement, consumer hardware could support workloads previously limited to data centers, including tensor parallelism and large-scale model training.

“This is research code, experimental in nature, and it loads experimental kernel modules on machines I was willing to crash repeatedly. No warranty, no support promise, not production software.”

— the developer behind the project

“We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two consumer boxes talk fast enough to run tensor-parallel inference and FSDP workloads across both machines.”

— the developer

Amazon

InfiniBand RDMA over USB4 adapters

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how stable, scalable, and compatible this setup will be in broader use. The project is experimental, and performance may vary across different hardware configurations. Additionally, the software is not supported for production environments, and potential hardware limitations or driver issues could impact real-world deployment.

Anker USB C Cable(3.3FT, 240W), USB 4 Data Cable, 40Gbps, 8K HD Display, Thunderbolt 4/3 Compatible, for iPhone 17, MacBook, Hub, Docking and More

Anker USB C Cable(3.3FT, 240W), USB 4 Data Cable, 40Gbps, 8K HD Display, Thunderbolt 4/3 Compatible, for iPhone 17, MacBook, Hub, Docking and More

Move Files Fast: Transfer music, movies, or entire seasons of TV shows in seconds at 40 Gbps.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Further development will likely focus on refining the kernel modules for stability and compatibility, expanding testing across different hardware, and exploring integration with existing AI frameworks. Future milestones may include open-source releases, community testing, and potential commercialization of the technology.

Apple 2024 Mac mini Desktop Computer with M4 chip with 10‑core CPU and 10‑core GPU: Built for Apple Intelligence, 16GB Unified Memory, 512GB SSD Storage, Gigabit Ethernet. Works with iPhone/iPad

Apple 2024 Mac mini Desktop Computer with M4 chip with 10‑core CPU and 10‑core GPU: Built for Apple Intelligence, 16GB Unified Memory, 512GB SSD Storage, Gigabit Ethernet. Works with iPhone/iPad

SIZE DOWN. POWER UP — The far mightier, way tinier Mac mini desktop computer is five by five…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Can this setup be used for production AI training?

No, this is experimental research code not intended for production. Stability and support are not guaranteed.

What hardware is needed to replicate this setup?

At minimum, a pair of AMD mini PCs with USB4/Thunderbolt ports and compatible host channel adapters (HCAs) are required, along with the custom kernel module.

Will this work with other hardware or operating systems?

The current development is specific to Linux and AMD mini PCs with USB4/Thunderbolt. Compatibility with other hardware or OSes remains untested.

How does performance compare to traditional networking options?

Preliminary tests show approximately 95 Gb/s bidirectional throughput with ~7 µs latency, outperforming typical Ethernet and soft-RoCE configurations in the same setup.

Source: Hacker News

You May Also Like

Cessation of public development of Kefir C compiler

The developer of Kefir C compiler announces indefinite suspension of public development, moving all future work to private mode for sustainability.

How Chinese renewable JVs are carving new investment route into US

Chinese renewable joint ventures are establishing new investment routes into the US, driven by recent legislative provisions, reshaping clean energy supply chains.

Gnutella: A Protocol Outliving the World That Created It

Gnutella, a pioneering peer-to-peer file sharing protocol, continues to operate despite declining mainstream relevance, showcasing its resilience and decentralized design.

The Death of Entry-Level Jobs: 43% of CEOs plan to slash junior roles over the next two years, shifting hiring to older, mid-level workers as Al takes over routine tasks, creating a catastrophic bottleneck for the future workforce.

A new survey shows 43% of CEOs intend to reduce junior roles over the next two years, driven by AI automation and organizational restructuring.