Skip to content

RoCE Networking‐Based LLM Post‐Training Solution with Intel Enterprise AI Foundation for OpenShift

MartinXu edited this page Jul 14, 2025 · 28 revisions

Introduction

OpenAI’s GPT-3 marked the dawn of a new era, proving that AI could profoundly transform human life. Built on the Transformer architecture introduced in the seminal paper, 'Attention Is All You Need,' GPT-3 demonstrated the unprecedented potential of scaling laws and illuminated the path for generative AI (GenAI) and artificial general intelligence (AGI).

This field is still rapidly evolving, with research papers emerging almost weekly introducing innovative approaches like Mixture of Experts (MoE), GRPO-based Machine Learning (ML) and Multi-head Latent Attention (MLA) etc. Industry adoption follows swiftly, as seen in the proliferation of both open-source models (e.g., LLaMA 4, DeepSeek-R1, Qwen3) and proprietary systems (e.g., GPT-4.1, Gemini 2.5, Grok-3, Claude 4)

However, pre-training these models on homogenized public Internet data with the massive computing resources hit a performance bottleneck, causing them to exhibit strikingly similar behaviors and limiting further breakthroughs. See Will LLMs Scaling Hit the Wall

Working in the AI area for years, we boldly envision that the next wave of innovation will leverage private domain-specific data for post-training and fine-tuning of foundation models. By injecting industry-specific knowledge through cutting-edge supervised fine-tuning (SFT) and reinforcement learning (RL) algorithms, these models can achieve enhanced reasoning performance. Besides, Mixture of Experts (MoE) technology will augment transformer models, enabling them to specialize in narrow domain areas (e.g., healthcare diagnostics or legal contract analysis). Combined with widely adopted knowledge distillation techniques, this approach will yield efficient, compact enterprise models capable of fast inference even in resource-constrained environments. Microsoft Phi-3 small language models and DeepSeek distilled reasoning models supports the vision.

To fulfill the vision, an efficient, affordable, scalable and production-grade enterprise training solution basing on Intel AI hardware & software technology seamlessly integrated with Red Hat AI platform is introduced in this paper.

Distributed Training & AI Network

To efficiently post-train a Large Language Model (LLM), same with pre-training, we must balance computation, communication, and memory through distributed parallelism algorithms.

As training clusters rapidly scale up to accommodate growing model sizes, various parallelism strategies—such as data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), Expert Parallelism (EP) and context parallelism (CP) —have been developed, alongside optimizations like DeepSpeed ZeRO and PyTorch FSDP. These techniques significantly improve training efficiency by maximizing the utilization of expensive hardware resources.

Nowadays, these distributed training technologies are widely used not only for pre-training but also for post-training tasks, such as fine-tuning models on specialized data and reinforcement learning (RL) to enhance reasoning performance.

All of these algorithms rely on collective communication algorithms, which are supported by the underlying AI network. Thus, a reliable, low-latency, and high-throughput network—scaling both intra-node (scale-up) and inter-node (scale-out)—is critical for the overall post-training process.

The major AI network technologies include RoCE (RDMA over Converged Ethernet) and InfiniBand.

RoCE leverages existing Ethernet fabric and switches as the physical layer, with RoCEv2 building connectivity over traditional UDP/IP (network and transport layers). This cost-effective, reliable solution is very suitable for the post-training for Enterprise AI Area.

Besides, Meta is also using RoCE to train the LLaMA models.

RoCE Network Infrastructure Detail

LLM fine-tuning example

Conclusion

Clone this wiki locally