Data processing is at the heart of nearly every business application. No matter your industry or the purpose of the application, eventually you’re going to process some data. But not all data is created equal. Nor are all sources of data. Dealing with data that arrives at regular intervals is a lot different from processing data streaming directly to your server.
Streaming data processing changes your architecture requirements, as well as the way you think about processing time. In this post, we’re going to dive into streaming data processing. We’re going to talk about what it is, how you should think about it, and how to make the most of it in your application.
What Is Streaming Data?
We define streaming data in contrast to another way of data processing: data at rest. Data at rest is how you normally think of data. Sitting on a server somewhere, saved to a hard disk or stored in a database. You’ve processed data this way dozens of times. You read it from the disk, load it into memory, and take some action based on what you find. Then, you discard the data from memory and move on to the next piece of the puzzle.
Streaming data operates differently. Instead of loading the data, it’s sent to your application, usually over a network. Data comes in a constant stream (hence the name!) of events and data packets. If you don’t do something with the data, it will eventually either overwhelm your server’s communications buffers, or quietly disappear into the ether.
For mission-critical data, that means you need to process the stream at least as quickly as the data comes to your server. Unlike when processing data at rest, one of the most important considerations for processing streaming data is latency. Streaming data most often comes from places like server logs, physical device sensors, or user input mechanisms.
How Can I Handle Streaming Data?
As noted above, latency is one of the most important considerations when architecting a system to process streaming data. Latency is the sum of all the delays built into your system. If your server’s processing latency for a single datum in a streaming service is longer than the time it takes the service to generate another one, your server will eventually overload. You’ll be forced to discard data, or your server will crash. Fortunately, it’s possible to effectively architect a service that is tolerant to these kinds of issues and avoids them through clever design.
Step One: Know Your Maximum and Average Load
Many software architects have determined the average load they expect their service to handle and happily shipped that service to production. The service works fine, for a while. Then, one day, for whatever reason, user demand spikes. The service, which might have been perfectly happy processing 100 events per second, suddenly needs to process 1,000. In just a few moments, the server processing streaming data fails, and the streaming data is lost.
This is why it’s important to understand your expected average and maximum loads for your service. Computing power is not infinite, nor are budgets. It’s unlikely that you can design a service that will scale to any arbitrary number of users. But you can design your service to operate within expected limits, and you can design fault tolerance for the times it will fail. As such, it pays to understand the 3 V’s of streaming data: Volume, Variety, and Velocity.
Step Two: Determine Whether to Persist Data as Early as Possible
Streaming data is, by its nature, ephemeral. It appears we process it, then it goes away. Most services that process streaming data will process data from multiple inputs. That means you need to identify both the source and, eventually, the destination of the data as quickly as possible. Some data you will want to persist to something more permanent than a stream processing server quickly. Some data you may only want to aggregate, or not persist at all. Instead, you’ll simply monitor that data for issues and only store it if you find a problem.
Due to the rapid-fire nature of streaming data processing, most servers will discard streaming data as quickly as possible. As such, if there’s a problem processing that data, you may find out only after that data is no longer present on the server. Whatever the problem was, you may have trouble recreating the issue, because the datum that generated the error doesn’t exist anymore. What’s more, if you truly need the data, you’ve lost it, with no way to recover.
Instead, the best practice is to persist any data that might need to be saved as early in the process as possible. That way, if there’s a processing error later, you’ll have the relevant data to look back at. You’ll be able to replay the data flowing through the system to identify the source of the data and re-ingest it into your service once you’ve fixed the bug.
Step Three: Don’t Put All Your Eggs in One Basket
One of the surest ways to stunt your streaming data service’s ability to scale is to run all of your data processing on the same server that ingests messages. This runs two risks: the first is that you could be overwhelmed with messages, causing your processors to fail in the middle of processing existing data. The second is that processing services could tie up all processor time on the server, leading it to silently discard data before your service can even begin to process it.
Instead, use your processing server to do some basic categorization. Then send the data to be asynchronously processed by a dedicated server using a system like Redis or Amazon SQS. Once your service dispatches a message, it doesn’t need to worry about it anymore. And you don’t have to worry that a hanging database connection will stop your streaming data server in its tracks.
Step Four: Monitor Your System
No matter how hard you plan your streaming data service, you’re going to get things wrong. That’s the reality of building complicated software systems. Nobody gets it right the first time. Once you deploy your service, your job is only just beginning.
Instead of designing and coding features, now your job is to monitor your new system. You need to ensure that it’s running within the expected latency levels and your usage patterns conform to what your research told you. You’ll likely find that some of your research was spot on. You’re also going to find that some of it were way off. By monitoring the system, you’ll know where you need to make upgrades, and where you need to reduce some computational resources.
Scalyr Can Help With Streaming Data Systems
As you can see, processing streaming data is a tall order. Getting it right isn’t something you do quickly, and even after trial and error, it’s not a sure bet. This is why customers turn to Scalyr’s Event Data Cloud, which connects to any number of streaming data sources. We’re experts at building streaming data processing systems, and our expertise means that your users can access data up to ten times faster than systems built in-house. What’s more, because we operate our systems at scale, Event Data Cloud often ends up several multiples less expensive than building and hosting services yourself. Our experts would be more than happy to talk about your needs. Find out how Event Data Cloud can make processing your streaming data faster and cheaper today.
Eric Boersma wrote this post. Eric is a software developer and development manager who’s done everything from IT security in pharmaceuticals to writing intelligence software for the US government to building international development teams for non-profits. He loves to talk about the things he’s learned along the way, and he enjoys listening to and learning from others as well.