Creating your own analytics platform within Liferay: A distributed commit log

On last entry I made a quick overview over the proposed solution for the "problem" of building an analytics platform within the Liferay platform. Along this entry I will go deeper into the log data structure, I will present the Apache Kakfa project and we will analyse how we can connect Liferay and Kafka each other.

As a quickly reminder, I previously said that a log data structure is a perfectly fit when you have a data worflow problem. 

 

A log is a very simple data structure (possibly one of the simplest one). It is just an ordered and append only sequence of records. For those of you who are familiar with database internals, log data structures have been widely used to implement the ACID support in relational databases and its usages have evolved over time and now it used to implement replication among databases (you can take a look to many of the implementations available out there).

Ordering and data distribution are even more important when we move into the distributed systems world; you can take a look to protocols like ZAB (protocol used by Zookeeper), RAFT (consensus algorithm that is designed to be easy to understand) or Viewstamped Replication. Sadly, distributed systems theory is beyond the scope of this blog post :)

Let's move into some more practical details and let's see how we can model all the different streams of information we already have

Apache Kafka

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
  • Kafka keeps feeds of messages in categories called topics
  • Processes that publish messages to a Kafka topic are called producers
  • Processes that subscribe to topics and process the feed of published messages are called consumers

It is not the goal of the post to cover Kafka's internals so, in case you are interested, a good documentation is available on their web page.

Connecting Kafka and Liferay

While building the first prototype of the communication channel between both systems I had a few goals in mind:

  • Easy to deploy and configure
  • Transparent for the regular user so learning a new API is not mandatory
  • Allow advanced usage of the Kafka API

I've built a small OSGi plugin which allows to bridge our Liferay portal installation with a Kafka broker through the Message BUS API. A general overview of how this integration works is shown in the next figure

The data flow depicted in the previous picture is extremely simple:

  1. The Kafka Bridge registers a new destination within the Message Bus. At the time of this writing this destination is called "kafka_destination" and cannot be changed
  2. If you want to send a message to the Kafka broker you just need to publish a message in the previous destination.
  3. The previous message needs to declare:
  • The name of the Kakfa topic we want to publish in
  • The payload with the contents of the message we want to store

You can find all the source code of the Kafka bridge at my Gitbhub repo.

A real example: publishing ratings

Let's write a small example where we publish all the blog ratings into our Kafka broker

 
Everytime we create and/or update a rating we can publish a new message:
 
It seems there is a bug in the blogs application so Gists cannot be properly inserted. I will update the blog entry once the bug is fixed
 
 
As you can see in the previous source code there is no new API to learn, you can publish your message into the Kafka broker just using our Message Bus API.  In order to test the previous example you just need to create a Kafka topic with the name used in the previous snippet and, for now, use the command line Kafka client which is included within Kafka installation
 
At this point we have a general overview of the system we want to build and how we can interconnect two of the main components (Liferay and Kafka). Along the upcoming entries we will build a more complex example where we will put in place the last piece of our infrastructure, the analytics side.
 
We will analyse some more advanced usages of Kafka, and we will introduce Spark as the foundation framework for building our analytics processes. Realtime and batch processing, machine learning algorithms or graph processing will be some of our future topics.