St Andrews HCI Research Group


Jun 2013

Information Retrieval and Real-time Analysis with Storm

Information retrieval document analysis tasks, such as those conducted by search engines are ideal Big Data tasks, as they are often embarrassingly parallelisable using techniques such as MapReduce. However, while a large portion of the document sets such as the Web are static in nature, social media sources such as Twitter and Facebook generate document streams comprised of hundreds of millions of posts each day. The easy availability, high volume and tendency of such streams to reflect real-world events make them ideal sources of information to drive applications such as event detection or real-time search. However, the scale of these data streams mean that distributed parallel processing is needed, while traditional distributed processing paradigms such as MapReduce are unsuited to streaming data due to their batch-orientated nature. In this presentation, we will provide a brief overview of Big Data analysis within IR, before moving onto the challenges that real-time streaming data streams pose, discussing the new generation of distributed steam processing platforms currently under development and illustrating how one specific use-case, namely real-time search can be accomplished using one such platform.