Telescent and MIT CSAIL Collaborate to Accelerate Machine Learning Workflows
IRVINE, Calif. – Telescent Inc., a leading manufacturer of automated fiber patch-panels and cross-connects for networks and data centers, announces today that results of the company’s collaboration with the Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory (MIT CSAIL), aimed at accelerating training time for machine learning workflows, will be showcased in an invited presentation at the Networked Systems Designs and Implementation (NSDI) conference taking place April 17-19, 2023 in Boston, MA.
The collaboration between Telescent and MIT CSAIL focused on improving the training time for #ML workflows by optimizing the communication between workers in the #GPU cluster through programmable network connections. Results accelerated workflows by 3.4x.
The NSDI conference focuses on the design principles, implementation, and practical evaluation of networked and distributed systems. The goal of the organization is to bring together researchers from across the networking and systems community to foster a broad approach to addressing overlapping research challenges.
Today’s machine learning (ML) training systems are deployed on top of traditional datacenter fabrics with electrical packet switches arranged in a multi-tier topology. The performance and efficiency of this architecture faces severe limitations because of localized network bandwidth bottlenecks. The Telescent programmable patch panel can provision and deliver network connections with essentially unlimited network bandwidth (i.e. thousands of Terabits per second) within a massive GPU cluster while consuming minimal energy. The collaboration between Telescent and MIT CSAIL focused on improving the training time for machine learning workflows by optimizing the communication between workers in the Graphics Processing Unit (GPU) cluster through programmable network connections. The collaboration accelerated workflows by 3.4 times, a significant performance improvement that overcomes limitations of current GPU clusters in ML training applications.
According to Manya Ghobadi, Associate Professor at MIT CSAIL and program co-chair of NSDI, large-scale ML clusters require enormous computational resources and consume a significant amount of energy. As a prime example, training a ChatGPT model with 65 billion parameters requires 1 million GPU hours and costs over $2.4 million. Just in January 2023, ChatGPT served 600 million live inference queries and used as much electricity as 175,000 people. As a result, “this trend is not sustainable,” said Ghobadi.
To address this challenge, the MIT CSAIL researchers proposed TopoOpt, a reconfigurable optical datacenter for DNN (Deep Neural Network) training leveraging the unique performance and scalability of the Telescent programmable patch panel. TopoOpt is the first ML-centric network architecture that co-optimizes the distributed training process across three dimensions, computation, communication, and network topology, to significantly improve performance. The team at MIT CSAIL integrated TopoOpt with Nvidia’s NCCL library and built a fully functional prototype of TopoOpt with the Telescent robotic patch panel and remote direct memory access (RDMA) forwarding at 100 Gbps. According to Prof. Ghobadi, “This is the only-known testbed that allows topology and parallelization co-optimization for ML workloads … our experiments showed that TopoOpt improves the training time of real-world DNNs by a factor of 3.4.”
“Large-scale deep neural networks are reshaping our daily life and how we interact with the world,” adds Weiyang “Frank” Wang, a third-year Ph.D Student working at the Network and Mobile System group at MIT CSAIL, advised by Manya Ghobadi. “TopoOpt is our latest attempt to speed up the training process of these large models through innovations in the fundamental infrastructures people use for these processes. Inspired by Telescent’s recent inventions on reconfigurable optical patch panels, we dive deep into the world of reconfigurable topology specifically for DNN training. Using reconfigurable network topology brings a new dimension for optimizing large DNN training workloads.”
For more information about the NSDI conference taking place April 17-19, 2023 in Boston, MA, go here. For information about the research at MIT CSAIL’s program, where computing empowers people and enhances all human experiences, click here. To learn how network operators, data center providers and carriers can automate key network functions such as cross connections, visit Telescent’s website at: www.telescent.com.
With a large-scale, low-loss, robotic patch-panel solution, Telescent Inc. brings automation to the fiber layer of optical networks. Telescent’s unique all-fiber design ensures optimal performance and reliability while enabling software control for remote reconfigurations and diagnostics. Automation of the fiber layer can reduce operating expenses and errors while creating new service opportunities for multi-tenant data center operators, telecom service providers, and at hyperscale cloud data centers. To learn more about Telescent, including any recent product updates, please visit www.telescent.com.