Creating your own analytics platform within Liferay

Yesterday I was talking at the Liferay North America symposium here in Boston about how you can get more value of all the data you already (even if you are not aware you already own it). It has been the first time I speak at the North America event so it has been really exciting for me (in addition, the put my talk on the big room :) To be honest I am not sure about how the talk was ... I tried to keep hidden all the gory details (at least as much as as I could) but I am not sure I succeed. The good part is I felt pretty comfortable during the talk :)
 
Coming back to the topic of the talk, I mainly went through some of the most popular storage and computational models available in the Big Data arena nowadays and right after that I proposed a reference architecture based on Open Source building blocks. Along this and a few upcoming blog posts I would like to share with you some of the technical details I've deliberately omitted during my talk and build a simple but powerful analytics platform on top of Liferay.
 

Reference architecture

On this first blog post I would like to make a quick tour over the main components of my proposed solution in order to offer a general overview of what I am trying to acomplish. A relly general overview of the final solution would be something like 

 

Basic reference architecture

As you can see in the previous image this is a really simple architecture but we will discover along the future blog posts how it can turn into a powerful and useful system. I'm basically trying to build a completely decoupled system where the source of the information has nothing to do with the consumer of it.

Previous decoupling would allow us to focus our efforts, so for example, we could have a team in charge of the User Tracking generation (maybe at the client side) while other team is reading this data from the event system and doing some processing on the stream of information.

We have three main pieces within the system we are trying to build:

The first one is the sources of information. And this is something where Liferay really shines, because you have already tons of different datasources with really useful info: ratings on different entities like blog posts or message boards entries, how different entities are interrelated, all the information you have stored in the database, search indexes, system events (like transaction information, render times, ...), browsing info (this is something we've done for the Content Targeting project), and many more I'm sure I am missing at this very moment.

The second main piece is the Event System. I am calling it Event System because I think most of you would be pretty familiar with this terminology but I'm basically refering to a log. Personally I think a kind of log dat structure is the best solution when you have to solve a problem of data flow between different systems.

A log it is just a data structure where the only way you have to add some information into it is at the end, and, all the records you insert are ordered by time

Log data structure preview

We will go deeper into this datastructure in the upcoming entries and we will see how the Apache Kafka project satisfies all the requirements we have. Of course, we will see how we can interconnect Liferay with the Apache Kafka distributed log.

Last but not least, we have the third main piece of our new infrastructure: the computational and analytics side. Once we have all the information we need stored within the log, we maybe need to "move" some of this data into an HDFS cluster so we can do some data crunching, our we want to do some "real time" analysis on a stream, or maybe we just want to write some machine learning algorithm to create some useful mathematical model. Don't worry, we will go deeper in the future.

I know, I haven't included too many technical details within this entry but I will do it in the future, I promise you.

Blogs
Thank you for sharing your implementation. I was actually thinking of designing something decoupled from Liferay that would collect fine-grained statistics on users' activity and content, using Piwik with some relevant database queries. Your view looks like more general and is far beyond what I had in mind. I am looking forward to your next article.