Don’t Let Storage Become the Key Bottleneck in Model Training

It’s been said that technology companies are either scrambling for GPUs or on the path to acquire them. In April, Tesla CEO Elon Musk purchased 10,000 GPUs and stated that the company would continue to buy a large quantity of GPUs from NVIDIA. On the enterprise side, IT personnel are also pushing hard to ensure that GPUs are constantly utilized to maximize return on investment. However, some companies may find that while the number of GPUs increases, GPU idleness becomes more severe.

If history has taught us anything about high-performance computing (HPC), it is that storage and networking should not be sacrificed at the expense of focusing too much on computation. If storage cannot efficiently transfer data to the computing units, even if you have the most GPUs in the world, you won’t achieve optimal efficiency.

According to Mike Matchett, an analyst at Small World Big Data, smaller models can be executed in memory (RAM), allowing more focus on computation. However, larger models like ChatGPT with billions of nodes cannot be stored in memory due to the high cost.

“You can’t fit billions of nodes in memory, so storage becomes even more important,” Matchett says. Unfortunately, data storage is often overlooked during the planning process.

In general, regardless of the use case, there are four common points in the model training process:

1. Model Training
2. Inference Application
3. Data Storage
4. Accelerated Computing

When creating and deploying models, most requirements prioritize quick proof-of-concept (POC) or testing environments to initiate model training, with data storage needs not given top consideration.

However, the challenge lies in the fact that training or inference deployment can last for months or even years. Many companies rapidly scale up their model sizes during this time, and the infrastructure must expand to accommodate the growing models and datasets.

Research from Google on millions of ML training workloads reveals that an average of 30% of training time is spent on the input data pipeline. While past research has focused on optimizing GPUs to speed up training, many challenges still remain in optimizing various parts of the data pipeline. When you have significant computational power, the real bottleneck becomes how quickly you can feed data into the computations to get results.

Specifically, the challenges in data storage and management require planning for data growth, allowing you to continuously extract the value of data as you progress, particularly when you venture into more advanced use cases such as deep learning and neural networks, which place higher demands on storage in terms of capacity, performance, and scalability.

In particular:

Scalability
Machine learning requires handling vast amounts of data, and as the volume of data increases, the accuracy of models also improves. This means that businesses must collect and store more data every day. When storage cannot scale, data-intensive workloads create bottlenecks, limiting performance and resulting in costly GPU idle time.

Flexibility
Flexible support for multiple protocols (including NFS, SMB, HTTP, FTP, HDFS, and S3) is necessary to meet the needs of different systems, rather than being limited to a single type of environment.

Latency
I/O latency is critical for building and using models as data is read and reread multiple times. Reducing I/O latency can shorten the training time of models by days or months. Faster model development directly translates to greater business advantages.

Throughput
The throughput of storage systems is crucial for efficient model training. Training processes involve large amounts of data, typically in terabytes per hour.

Parallel Access
To achieve high throughput, training models split activities into multiple parallel tasks. This often means that machine learning algorithms access the same files from multiple processes (potentially on multiple physical servers) simultaneously. The storage system must handle concurrent demands without compromising performance.

With its outstanding capabilities in low latency, high throughput, and large-scale parallel I/O, Dell PowerScale is an ideal storage complement to GPU-accelerated computing. PowerScale effectively reduces the time required for analysis models that train and test multi-terabyte datasets. In PowerScale all-flash storage, bandwidth increases by 18 times, eliminating I/O bottlenecks, and can be added to existing Isilon clusters to accelerate and unlock the value of large amounts of unstructured data.

Moreover, PowerScale’s multi-protocol access capabilities provide unlimited flexibility for running workloads, allowing data to be stored using one protocol and accessed using another. Specifically, the powerful features, flexibility, scalability, and enterprise-grade functionality of the PowerScale platform help address the following challenges:

- Accelerate innovation by up to 2.7 times, reducing the model training cycle.

- Eliminate I/O bottlenecks and provide faster model training and validation, improved model accuracy, enhanced data science productivity, and maximized return on computing investments by leveraging enterprise-grade features, high performance, concurrency, and scalability. Enhance model accuracy with deeper, higher-resolution datasets by leveraging up to 119 PB of effective storage capacity in a single cluster.

- Achieve deployment at scale by starting small and independently scaling compute and storage, delivering robust data protection and security options.

- Improve data science productivity with in-place analytics and pre-validated solutions for faster, low-risk deployments.

- Leveraging proven designs based on best-of-breed technologies, including NVIDIA GPU acceleration and reference architectures with NVIDIA DGX systems. PowerScale’s high performance and concurrency meet the storage performance requirements at every stage of machine learning, from data acquisition and preparation to model training and inference. Together with the OneFS operating system, all nodes can seamlessly operate within the same OneFS-driven cluster, with enterprise-level features such as performance management, data management, security, and data protection, enabling faster completion of model training and validation for businesses.


Post time: Jul-03-2023