-
Notifications
You must be signed in to change notification settings - Fork 13
RoCE Networking‐Based LLM Post‐Training Solution with Intel Enterprise AI Foundation for OpenShift
OpenAI’s GPT-3 marked the dawn of a new era, proving that AI could profoundly transform human life. Built on the Transformer architecture introduced in the seminal paper, 'Attention Is All You Need,' GPT-3 demonstrated the unprecedented potential of scaling laws and illuminated the path for generative AI (GenAI) and artificial general intelligence (AGI).
This field is still rapidly evolving, with research papers emerging almost weekly introducing innovative approaches like Mixture of Experts (MoE), GRPO-based Machine Learning (ML) and Multi-head Latent Attention (MLA) etc. Industry adoption follows swiftly, as seen in the proliferation of both open-source models (e.g., LLaMA 4, DeepSeek-R1, Qwen3) and proprietary systems (e.g., GPT-4.1, Gemini 2.5, Grok-3, Claude 4)
However, pre-training these models on homogenized public Internet data with the massive computing resources hit a performance bottleneck, causing them to exhibit strikingly similar behaviors and limiting further breakthroughs. See Will LLMs Scaling Hit the Wall
Working in the AI area for years, we boldly envision that the next wave of innovation will leverage private domain-specific data for post-training and fine-tuning of foundation models. By injecting industry-specific knowledge through cutting-edge supervised fine-tuning (SFT) and reinforcement learning (RL) algorithms, these models can achieve enhanced reasoning performance. Besides, Mixture of Experts (MoE) technology will augment transformer models, enabling them to specialize in narrow domain areas (e.g., healthcare diagnostics or legal contract analysis). Combined with widely adopted knowledge distillation techniques, this approach will yield efficient, compact enterprise models capable of fast inference even in resource-constrained environments. Microsoft Phi-3 small language models and DeepSeek distilled reasoning models supports the vision.
To fulfill the vision, an efficient, affordable, scalable and production-grade enterprise training solution basing on Intel AI hardware & software technology seamlessly integrated with Red Hat AI platform is introduced in this paper.
To efficiently post-train a Large Language Model (LLM), same with pre-training, we must balance computation, communication, and memory through distributed parallelism algorithms.
As training clusters rapidly scale up to accommodate growing model sizes, various parallelism strategies—such as data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), Expert Parallelism (EP) and context parallelism (CP) —have been developed, alongside optimizations like DeepSpeed ZeRO and PyTorch FSDP. These techniques significantly improve training efficiency by maximizing the utilization of expensive hardware resources.
Nowadays, these distributed training technologies are widely used not only for pre-training but also for post-training tasks, such as fine-tuning models on specialized data and reinforcement learning (RL) to enhance reasoning performance.
All of these algorithms rely on collective communication algorithms, which are supported by the underlying AI network. Thus, a reliable, low-latency, and high-throughput network—scaling both intra-node (scale-up) and inter-node (scale-out)—is critical for the overall post-training process.
The major AI network technologies include RoCE (RDMA over Converged Ethernet) and InfiniBand.
RoCE leverages existing Ethernet fabric and switches as the physical layer, with RoCEv2 building connectivity over traditional UDP/IP (network and transport layers). This cost-effective, reliable solution is very suitable for the post-training for Enterprise AI Area.
The RoCE Networking-Based LLM Post-Training Solution with Intel Enterprise AI Foundation for OpenShift delivers a end-to-end solution for large language model (LLM) post-training workloads for enterprise AI area. By harnessing Intel's advanced hardware, such as Gaudi accelerators, and the built-in RoCE, it provides a cost-efficient, scalable and high-performance solution.
The seamless integration with Red Hat OpenShift and OpenShift AI provides a production-grade platform for deploying and managing AI workloads, offering enterprises a robust, flexible, and user-friendly environment.
The Intel Technology Enabling for OpenShift project provides Intel Data Center hardware feature-provisioning technologies with the Red Hat OpenShift Container Platform (RHOCP). The technology to deploy and manage Intel Enterprise AI End-to-End (E2E) solutions and the related reference workloads for these features are also included in the project.
Fast GPU Provisioning technology enables GPU provisioning less than 1 second with no reboots using pre-built driver containers. The feature eliminates any dependency on machine configuration which triggers reboot, an expensive operation. Instead, the required operations are performed at runtime. This leads to a simplified and accelerated deployment process.
When the containers need access to device files they usually need to run as root UID/GID as 0/0. But when the device plugins make the device files available to the workload containers, it is owned by root and thus the containers need to run as root. But it is not a good security practice. So its always a good idea to run containers as rootless. Here is short tutorial on how to run the Intel Device plugins so they the workload containers can run as rootless. By default this is not turned on.