Table of Contents
- Summary
- The Current Landscape for Streaming Data and Message Queuing
- Evaluating Potential Solutions
- Conclusion and takeaways
- Appendix – Deploying and Executing the Benchmark
- About William McKnight
- About Jake Dolezal
- About GigaOm
- Copyright
1. Summary
Historical paradigms around data are being shifted due to exponential data growth, hybrid cloud architectures, real-time data needs, massive computing at scale, globalization, 24×7 uptime, and the Internet of Things. The list could go on. Organizations are either seizing opportunities, scrambling to keep up, or coping with the changes in their industry verticals due to technological shifts. Many want or need to adopt these newer technologies, but sometimes struggle with how to evaluate or select them based on their current and future needs.
The needs and uses of data have evolved to a new level. In today’s climate, businesses must be able to ingest, analyze, and react to data immediately. With artificial intelligence and machine learning progressing by leaps and bounds, we see increasing numbers of technologies emerging as data-driven, autonomous, decision-making applications—where data is produced, consumed, analyzed, and reacted to (or not) in real-time. In this way, the technology becomes sentient of what’s going on inside and around it—making pragmatic, tactical decisions on its own. We see this being played out in the highly publicized development of self-driving cars, but beyond transportation, we see it in telephony, health, security and law enforcement, finance, manufacturing, and in most sectors of every industry.
Prior to this evolution, the analytical data was derived by humans long after the event that produced or created the data had passed. We labored under the assumption that by analyzing the past, we could set the course for the future—in our businesses, in our systems, in our world. Now, however, technology has emerged that allows us to capture and analyze what is going on right now. Much data value is time-sensitive. If the essence of the data is not captured within a certain window of time, its value is lost and the decision or action that needs to take place as a result never occurs.
This category of data is known by several names: streaming, messaging, live feeds, real-time, event-driven, and so on. This type of data needs special attention, because delayed processing can and will negatively affect its value—a sudden price change, a critical threshold met, an anomaly detected, a sensor reading changing rapidly, an outlier in a log file—all can be of immense value to a decision maker, but only if he or she is alerted in time to affect the outcome.
Much has already been said and written about artificial intelligence and autonomous decision making. The focus of this paper is the gap between the production and consumption of real-time data and how these autonomous, sentient systems can be fed at scale and with reliable performance. There are a number of big data technologies designed to handle large volumes of time-sensitive, streaming data.
We will introduce and demonstrate a method for an organization to assess and benchmark—for their own current and future uses and workloads—the technologies currently available. We will begin by reviewing the landscape of streaming data and message queueing technology. They are alike in purpose—process massive amounts of streaming data generated from social media, logging systems, clickstreams, Internet-of-Things devices, and so forth. However, they also have a few distinctions, strengths, and weaknesses.