Most of the today technology infrastructures are not designed for real time. The average systems are geared for batch processing. We store data in a central location and when we want a piece of information we have to find it, retrieve it and process it. Either automatically with pipelines and script or manually through the hard work of a data scientist.
That’s the way most systems work.
That is not the way the human mind works.Human memory behaves more like a flash memory. We have domain specific knowledge that is available in our brain from previous learnings. We leverage this knowledge to react and respond much more quickly every time we acquire new evidence from the surrounding.
Our intelligence is distributed, optimised for fast access.
Real-time is a step toward building machines that respond to problems the way people do.At this moment, it’s clear that infrastructures for handling real-time data processing are slowly emerging from a desperate set of programs and tools. Storm, Spark Stream, Google DataFlow, Amazon Kinesis, to name few.
What is not clear, however, is what an architecture for RTDA will look like.
In this article, I’ll try to sketch out a generic architecture to enable real-time data processing, as well as draw a roadmap of what needs to happen in order to drive the cultural change from batch to real-time analytics.
Build around the user
One of the biggest pushes to embrace real-time analytics is the drive to better serve the end user. In a world where customers have limited attention span and low patience threshold, acting fast and effectively is key to success.
For better understanding the architecture proposed, we will use a concrete example from a real case scenario.
Let’s assume we are offering content to the user, you can imagine this as anything that the user consume throughout the experience: a movie, a blog post, music, anything.
Now, you probably have already a recommendation system in place. There are high chances that it has been implemented with some sort of collaborative filtering.
That’s a classical batch approach to the problem of recommending content. Every n hours, you recompute similarities based on all the user behaviour. Depending on the type of content n can vary between 24h to 72h.
That works or at least works ok-ish. Long tail and cold start are the common pitfalls of that approach. From a user perspective, there are two problems: the lack of personalization for the user, and the assumption that the same user has the same behaviour in two different sessions. The former has been recently tackled by rearranging the recommended content based on the user profile, usually by introducing the user vector in the same space of the embeddings generated by the content (if the last sentence sounds obscure, don’t worry, there will be an article on that at some point).
The second problem is a lot harder to solve.
Isn’t that true that if you are the gym you are listening to a completely different type of music that when you are at home ready to go to sleep?
Arent’ people passionate about action movie allowed to indulge on a romantic comedy from time to time?
Those are just few example of in-session specific behaviour.
If you spend some time in analysing your users’ sessions and cluster them together you will notice that the assumption that the user ‘wants always the same thing’ is mostly wrong.
What you want to do is to react within the session, and move from a batch based recommendation to a real time one.
In the rest of the article we will go through the architecture needed to develop this and many other real-time features.
The Journey of Data
In the real world, data comes from a heterogeneous ecosystem. Backends services and clients emit data, systems emit data, and often there are inbound of data from the outer world. Data is collected by agents. You can consider an agent here as anything that can fulfil this task: from the SDK that instrument your data collection in the iOS app, to an inbound upload to an SFTP from an external provider, even a human entering data manually is an agent in this context.
The agent also takes care of some early structuring of the data. Usually is here that a schema is applied to raw data.
Now the data is ready to be delivered through a messaging system. Kafka e Google pub-sub are the most well-known event delivery systems both based on the publisher-subscriber design pattern. Those systems use Topics to allow subscriber to register their intention to consume a specific subset of topics.
Stakeholders driven design
The idea of Kappa Architecture was first planted by Jay Kreps. He recognized how difficult it is to implement Lambda Architecture having the same job in two different systems (batch and real time). He decided to have only one system where both Batch and real-time data processing could be performed, thus maintaining only one framework.
Data Layer – Storage
Data Layer – Query
The phases of RealTime
Valuation and Deployment
Real Time Scoring
My advice is to use a batch processing framework like MapReduce if you aren’t latency sensitive, and use a stream processing framework if you are, but don’t try to do both at the same time unless you absolutely must.
As information technology systems become less monolithic and more distributed, real-time data analytics become less exotic and more commonplace. With infrastructure cost falling and the slow platformisation of data science machine learning techniques, real-time analytics will become soon a commodity leveraged by companies as a competitive advantage. A new branch of data science will gain more and more importance: Decision Science. The creation of analytics and the consumption of analytics are two different things. You need processes for translating the analytics into good decisions. The challenge will be less about transforming the technology, and more focused on transforming people and processes around data driven decision making.