Modernizing storage for the age of AI • The Register

Modernizing storage for the age of AI

How Dell is boosting storage performance for AI with PowerScale

Sponsored Feature You might have already analyzed AI use cases for artificial intelligence (AI) within your business and identified potential efficiencies, revenue opportunities and more.

Now comes the hard part: building an infrastructure that supports your mission. Computing capacity is a crucial part of that portfolio, but companies often overlook another equally important ingredient: storage.

Investing heavily in the latest GPUs or cloud capabilities to give yourself an edge in the training and inference of AI models is important, but it will all be for nought if you can't feed the beast with the data it needs to deliver results. That's where scale-out storage technology comes in - to help provide organizations with the answers to the infrastructure questions which this new world of AI is asking.

"Data is a differentiator for companies involved in AI," says Tom Wilson, a product manager at Dell Technologies focusing heavily on AI workloads, who analogizes data as the fuel, compute as the engine and storage as the fuel tank.

"Having a modernized platform that provides the security, storage efficiency, performance, and scale that companies need to use that data in AI workflows is one of our key pillars for PowerScale."

The benefits of scale-out storage

Wilson is a veteran evangelist for the technology underpinning PowerScale, the Dell file storage solution which has been upgraded to deliver an AI-optimized storage infrastructure with the launch of two new PowerScale F210 and F710 flash arrays. Leveraging the latest PowerEdge hardware and OneFS software, PowerScale is showcased as a key component of an 'AI-ready data platform' designed to provide the performance, scale and security needed to help customers build AI-enabled workloads wherever their data lives, on premises, or cloud and cloud-adjacent environments.

Dell was one of the first companies to support the NVIDIA GPUDirect protocol which enables storage systems to rapidly send and receive data without involving the host processor to accelerate AI workloads. Wilson recalls that customers were grappling with the rise of storage volumes thanks to unstructured data even before GPUs and cloud computing had taken AI mainstream, but the surge in demand for AI and generative AI (GenAI) enabled applications has put even more strain on existing storage infrastructure.

"One of the things that we wanted to help solve was how do you manage massive amounts of data predictably, all around the world," Wilson says. "That's what led us to create a scale-out file system."

Traditional scale-up storage can struggle to handle the vast volumes of data needed to feed AI models for a couple of reasons. Firstly, it expands by adding more drives to a single system with its own dedicated head unit. The obvious downside to this approach is limited capacity, as the chassis will eventually run out of space.

The less obvious drawback is limited performance. The head unit, which organizes the storage, will come under an increasing load as the storage volume rises and more disks are added, explains Wilson.

The performance that you get with the first few dozen Terabytes in a scale-up system might be great for your needs, but as you add more storage capacity the performance doesn't increase. At some point says Wilson, the storage workflows might outgrow the throughput that a scale-up system can provide.

Conversely, scale-out storage uses clustered storage nodes, each of which has its own computing and storage capacity. Adding another node to the system boosts the computing capacity of the entire cluster. "So when you add capacity, you aren't just scaling up by adding drives; you're adding performance," he adds.

Inside the PowerScale architecture

PowerScale's next-generation nodes, the F210 and F710 improve on the previous generation all-flash nodes, leveraging the latest -generation PowerEdge platform to deliver faster computing capabilities in the form of fourth-generation Intel Xeon Sapphire Rapids CPUs. They also feature improved memory speed thanks to the latest DDR5 DRAM options. A faster PCIe Gen 5 bus offers up to quadruple the throughput compared to the PCIe Gen 3 used in previous nodes.

These hardware improvements are especially relevant for AI applications, explains Wilson. For example, the mix of PCIe and SSD interface improvements helps to double the streaming read and write throughput - key performance metrics that affect the phases of the AI pipeline like the model training and checkpointing phases.

The 1U-format systems have also increased their node density by adding the capacity needed to ensure the vast volumes of data that AI requires can be easily accommodated. The F710 features room for 10 drives compared to the F600's eight, while the F210 doubles capacity with the introduction of the 15Tb drive.

The systems also feature a Smart flow chassis - a piece of IP from Dell's PowerEdge hardware - that pushes air through the system more efficiently. This helps maintain system reliability while reducing the power used for cooling, explains Wilson – an important consideration in datacenters facing big electricity bills and total cost of ownership challenges in powering the server, storage and network equipment required to get and keep AI workloads running. It contributes to a key efficiency increase figure for the new units - the F710 offers up to 90 percent higher performance per watt compared to the previous generation of the product.

How advanced software complements the hardware

Dell has also updated the PowerScale's OneFS operating system to take full advantage of the hardware enhancements.

Features like thread optimization help to bolster AI performance. Dell reports up to a 2.6 times improvement in throughput in the F710 compared to the F600 when handling the kind of high-concurrency, latency-sensitive workloads that underpin many AI training and inferencing applications, for example.

"The performance improvements of all-flash NVMe drives means that we don't necessarily need the same level of caching that we used previously," says Wilson. "OneFS optimizes communications to those NVMe drives, using techniques like read locking. We also write directly from the journal to the drives."

OneFS 9.6 also added another important capability for AI workloads - the ability to handle AI training and inferencing tasks with hybrid cloud capability. APEX File Storage for AWS was launched with OneFS 9.6, while more recently OneFS 9.8 introduced APEX File Storage for Azure as well – allowing organization even greater flexibility and choice, says Dell. By running OneFS in the cloud, customers can move a subset of the data they need off-premises. They might choose to handle data preparation and cleansing on- premises, for example, and then move the prepped data into the cloud to take advantage of computing capabilities that they don't have on-site.

The key benefit of running PowerScale in a cloud environment is that customers can take their security model along with them, explains Wilson. They move the data they need using native replication in OneFS, making data available with the same security policies, permissions, and identity management parameters in the cloud as they already have on- premises. They don't have to refactor their workflows, which means they can quickly move to the next part of the AI pipeline without skipping a beat, while staying compliant with their data privacy and protection policies.

A comprehensive AI infrastructure

PowerScale storage can be optimized for efficiency, performance and cost depending on the specific AI workflow it is destined to support, says Dell (whether model retention, data preparation or large-scale model training or tuning for example). The new units were already producing useful results in field tests with Dell customers by the time they were released for general availability. Alan Davidson, CIO at Broadcom, said that the systems had helped significantly bump up performance in its electronic design automation (EDA) operations.

"Collaborating with Dell means faster innovation for my business. The new Dell PowerScale F710 has exceeded our expectations with more than 25 percent performance improvements in our EDA workloads while delivering improved datacenter sustainability," he told Dell.

These systems further built out a portfolio that can serve complex AI infrastructure, enhanced by partnerships including that between Dell and NVIDIA. The F710, the first Ethernet-based storage appliance certified by NVIDIA DGX SuperPOD, is a key part of the Dell AI Factory that the company announced with NVIDIA in March. It's an end-to-end validated combination of Dell infrastructure and NVIDIA GPUs and software that supports the entire generative AI life cycle.

"Nobody is better at building end-to-end systems for the enterprise than Dell," said NVIDIA CEO Jensen Huang at the company's GTC 2024 AI developer conference.

This combined hardware and software portfolio ties into a range of documentation and architectural guidance from Dell.

"Not only do we have best of breed infrastructure, but we also have the expertise, whether it's on the services side or in terms of best practices documentation and validated designs and reference architectures," Wilson says. "We have the complete stack to help customers simplify their AI journeys."

As they rush to adopt AI, organizations are grappling to manage their infrastructure. Because AI projects are so data intensive, the chances are good that at least part of a company's AI pipeline will involve on-premises storage.

Getting the storage part of the infrastructure portfolio right can eliminate bottlenecks further along in the process as development teams, software engineers, data scientists and others begin to deal with the large volumes and high bandwidth requirements necessary to feed these AI workloads. In this data-laden future, optimized scale-out storage infrastructure increasingly looks like the right approach.

No organization can afford to rest on its laurels when it comes to ensuring the business has the efficient, high-performance infrastructure it needs to build and launch new AI-enabled applications and services. Continuous optimization and upgrades are the norm in IT - and in many cases has been rendered more critical by the recent surge in demand for AI. Dell is expected to keep up its own momentum and announce even more enhancements to its AI-optimized portfolio at Dell Technologies World 2024 to enable customers in this AI era.

Sponsored by Dell.

More about

More about

More about

TIP US OFF

Send us news