Table of Contents
- Summary
- Data Storage for the AI Era
- About the Pavilion Hyperparallel Flash Array
- Takeaways
- About GigaOm
- Copyright
1. Summary
Every organization, regardless of size, is challenged to analyze data quickly. Big data, analytics, IoT, HPC, Machine Learning (ML), and Deep Learning (DL) are examples of technologies and trends in which data is not only the most important asset but also a resource upon to build. Enterprises need to gain insights faster to exploit new opportunities and stay ahead of the competition, always ready for market changes that could hint at the next wave to ride. They must also improve efficiency by streamlining complex and repetitive processes. Today, big data analytics and even more strategic machine learning projects can make a big difference between being a leader or a follower. And this is only the first step, some organizations are already looking at deep learning to solve even more complex problems, requiring more data and compute resources to make it fully effective.
Technology is not the problem. Big data applications and frameworks are mature while tools for ML, like TensorFlow for example, have tremendous support from the developer community and are becoming more powerful, feature-rich, and user-friendly. On the hardware side, fast CPUs, coprocessors like GPUs, FPGAs, and flash memory are becoming more and more affordable. High-speed networking, like 100Gb/s Ethernet and Infiniband, moves data quicker than ever.
The real challenge for IT is to build sustainable infrastructures that can serve today’s needs and also be ready for the future. The transition from big data analytics to ML and DL can be achieved with more parallelism and detailed workflows that can share the same data sets and compute resources. Even though better technology is available now, storage is still the hardest to get right. Thanks to flash memory and new interfaces we are capable of high performance, but not all commercial products take full advantage of it. In fact, traditional storage vendors are usually conservative and slow in adopting new technology.
Modern number-crunching infrastructures (i.e. big data or ML) are all scale-out. They are big, complex, and ever-evolving. Traditional storage systems simply do not have the right characteristics to cope with this kind of challenge. At the same time, direct-attached storage (DAS) is fast but inefficient, without data services, and complex to manage at scale. The concern is not about the total cost of acquisition (TCA) but should focus on the total cost of ownership (TCO) and delivering actionable insights and ML value yielding a strong return on investment (ROI).
Critical pain points of traditional storage approaches to ML are:
- Lack of efficiency: Traditional storage arrays are being adapted not designed around new technology. NVMe is a vivid example, with old arrays having been retrofitted to take advantage of the new characteristics available in NVMe devices. This also makes it harder/impossible to efficiently mix different types of media (e.g. TLC, QLC, or Intel Optane™) in the same array.
- Mediocre real-world performance: Adaptation brings constraints that limit performance during the heavy load of many parallel activities. Big data and ML require the best performance in terms of latency, parallelism, and throughput from the storage layer.
- Insufficient orchestration: In order to manage big compute clusters for big data and Artificial Intelligence (AI), API-driven automation is key. The array must provide resources quickly (fast provisioning) without giving up on advanced data services.
This is why data storage remains the hardest part of the infrastructure. We need new, innovative solutions that are designed as follows:
- Around new technology: Traditional storage system paradigms do not work anymore. New NVMe-oF based architectures unlock all the potential of flash memory.
- For scale-out systems, but with enterprise functionalities: Compute nodes in scale-out architecture usually leverage local storage. On the other hand, a shared storage system that pulls together the operational efficiencies, data protection, and services with better performance than local storage, at a reasonable cost, is crucial to boosting overall cluster efficiency and improving TCO.
- To be flexible and ready for changing needs: They are always ready to answer to new business needs and technology improvements; adaptable to evolving workflow and workload needs. Easy to use and manage, compatible with common orchestrations and configuration solutions (such as Ansible, Chef, Puppet, Terraform, vCenter, and others) and open standards (such as Redfish, Swordfish, OpenStack, RESTful API, and others).
Pavilion’s Hyperparallel Flash Array (HFA) looks to respond to these needs. It is the industry’s first HFA that combines the benefits and performance of DAS NVMe-based flash memory with the flexibility, availability, and data services of storage area networks. HFA yields a quicker ROI than local storage and improves TCO when compared to traditional enterprise storage arrays.