Big Data Processing at Easemob – Big Data Processing Platform
Easemob plug-in and its SDK on mobile text and voice messaging communication service become extremely popular in China recently and are used by over 13,000 mobile Apps with millions of registered users in total. One of the driving design principles of Easemob is that better quality of social activities means a better App with a more engaging user experience. With regards to this, Easemob decided to invest into Big Data technology for app developers in the form of an interactive web portal that presents business analytics related to their plug-ins.
Figure 1 Easemob Social Activity Big Data Analysis Application
Most data on mobile Apps with Easemob’s plug-in is consumed in real-time. Easemob recognized that providing results to app developers in closer to real time could help generate more engaging social activities by allowing more dynamic interaction with users. The finer grained details of a real-time system also provide much more information than a delayed multiple hours aggregate. Hence, the primary goal of Easemob's big data analysis system was to provide analytic results to app developers with minutes or even seconds of latency.
To achieve such target, we actually don’t have many options. The original idea is to use apache hadoop to complete the task. Hadoop MapReduce/Hive based solution certainly can handle such scale required by Easemob with high availability and accuracy. However, its batch processing logic based upon MapReduce framework means probably it won’t deliver results in real time. Hence Easemob will not be able to provide a specific SLA for its app developers if such idea is implemented.
With some careful research, we came up with a solution that relies on Spark integrated with Cassandra to deliver results. Easemob has been already using Cassandra for its messaging system and the nosql big table database has been proved to be able to scale and meet the requirement of Easemob app clients. In summary, we chose Spark and Cassandra for the new big data processing system based upon high performance Spark can provide and high availability and scalability from Cassandra. Some details about Spark and Cassandra are discussed in a later section of this article.
Spark is an open source cluster computing system developed in the UC Berkeley AMP Lab. Some benchmark results done by Berkeley shows the results of comparison of performance between spark/shark and hadoop/hive combination.
Figure 2 Benchmark of Spark/Shark and Hadoop/Hive
It is not surprising to see Spark beats Hadoop easily. This is because spark provides primitives for in-memory cluster computing while MapReduce processing can only occur on data stored either in a file system (unstructured) or in a physical database (structured). It can not avoid I/O bottleneck between the individual jobs of an iterative MapReduce workflow. Spark does not have this issue.
Apart from its lighting speed, spark provides a few interesting features which will be needed by Easemob’s big data processing system sooner or later. Shark for Spark is like Hive for Hadoop. It actually builds directly on the Apache Hive codebase. However, it has a “Spark” heart. Underneath, it is using Spark execution engine for its query processing so it appears much faster than Hive. Spark also directly supports real time processing through its streaming framework on which Easemob’s social activity big data engine will be built. The apache machine learning library is tightly integrated into Spark. Interactive data mining is a supported feature by Spark. It is absolutely vital feature for Easemob ‘s future success of its social activity big data analysis system.
As for Cassandra, as mentioned previously in this article, it has been selected by Easemob for its messaging system. Today in Easemob, more than 100 millions messages a day are being handled by Cassandra database across multiple data centers. The big table database has been proven to be able to scale. The figure 3 as below shows what the big data processing platform for Easemob’s future social activity analysis engine looks like.
Figure 3 - Platform for Easemob Social Activity Big Data Analysis
There is one thing I particularly like about Spark. It makes use of the Scala language, which allows distributed datasets to be manipulated like local collections. It combines Object-oriented and functional programming perfectly. It has a very strong static type system. It overcomes many shortcomings of Java language but however, it is still compiled to java bytecode hence it runs on java virtual machine. The Interoperability of the two languages is maintained well. It is perfectly integrated into Spark.
In the big data world, Spark is relatively new technology, however, it is already used by many companies in production such as Yahoo, Sohu and Ebay etc. The long list can be viewed in Apache website.
About the author
Zhi Huang is the Head of Big Data Analytics Department in easemob.com. He can be reached at email@example.com
Today, in China, our mobile text and voice messaging communication service serves more than 13,000 mobile apps and hundreds of millions of users. We are currently expanding our business globally. If you are app developer, you are more than welcome to use our free service at any time.