In previous entries I’ve described how we can connect Liferay and Apache Kafka, and I’ve shown a couple of examples of how this integration could be done (you can find the first blog entry here
and the second one here
At this point, we are able to collect as much information as we need and store it in a reliable and scalable platform but we are still missing the most important part of the picture: what are we going to do with all this information? We are not getting any value from it so; let’s try to solve it!.
We are going to build a messages classifier for our message boards, so we can create clusters of groups for the new topics created within our system. A general overview of the problem we are trying to solve is depicted in the picture below
1. Data collection
The first step of the process (as we already covered during one of the previous entries) is to collect the information we need. We are going to collect message board's entries. We can create a new model listener (be careful with the model listeners ...) and publish a message to our Kafka installation every time a new entry is created
2. Collecting the input for the model creation process
Everytime we create a new MBMessage we are sending it to our Kafka broker so we have already persisted this information into a durable storage. Now, we need to collect the input to build our classifier, and, in order to do that, we are going to use Spark Streaming to read all the messages published in our Kafka topic and store them at a different place.
It is important to note that there is no golden rule to determine the type and the quantity of information you need to build your model. For example, we could rebuild our model every other day, using all the messages stored during the last 2 weeks. Depending on the quality and amount of information you have, your model's results will be very different.
As shown in the previous snippet, we are just reading from a Kafka topic and storing all those reads within the filesystem. We could store this messages inside a HDFS cluster with no efforts at all, but, for writing and testing purposes, I am just using my computer's filesystem. This data will be the input to build our clustering model
3. Creating our model
Once we've already collected the input we are ready to build our clustering model. In order create our model, we are going to use the K-means algorithm (through the Spark MLib framework).
Clustering is an unsupervised learning problem where we aim to group subsets of entities based on some definition of similarity. You can find more info here
The input to our model's training is all the information we have collected in our previous step, and we will store the output of the training process (the model), again, in our filesystem
Looking at the previous code, we can see the following line:
val vectors = bodies.map(Utils.featurize).cache()
The featurize function is defined in the Utils.scala file as follows:
As you can see we are using the org.apache.spark.mllib.feature.HashingTF which maps a sequence of terms to their term frequencies using the hashing trick.
Model's effectiveness will determine us how frequently we should update our model or if the data we selected as the input for the training process was the correct one. But this is something which is very related to our own data.
4. Clustering messages
We are just missing the last step: classify new incoming messages. We collected the data and trained the model. Now, every time a new message board is created, we want to determine its group it (i.e: we want to "predict" the group the new message we've just created belongs to).
Again, we are using Spark Streaming to "stream in real-time" the messages published in the Kafka topic and apply our previously created model to it.
As you can see, the code is fairly simple: we just need to "stream" the messages from the Kafka topic and apply the model to every new message.
5. Wrapping up
We've created a simple but powerful classifier system with very little code thanks to the power of Open Source. We could apply this approach to many other areas of our system like, for example, clustering users based on the way the browse the system.
Of course, clustering is not the only thing you can do (and K-means is not the best algorithm either): linear regressions, collaborative filtering, dimensionallity reduction, feature extraction, .... Upcoming entries will cover some of those topics (possible we will create a recommender system for our portal in the next one)
You can find all the source code here.
Note this code is not production ready, and it has been written with teaching purposes.