MG-FSM: Large-Scale Frequent Sequence Mining
MG-FSM is a scalable, distributed (i.e., shared nothing) algorithm for frequent sequence mining (FSM) on MapReduce. The algorithm can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences.
At a high-level, MG-FSM carefully partitions and rewrites the set of input sequences in such a way that each partition independently and in parallel. We achieve this by extending the notion of item-based partitioning, which underlies a number of frequent pattern mining algorithms to gap-contrained frequent sequence mining. Our experiments, in which we mined more than 1 billion sequences, suggest that MG-FSM is multiple orders of magnitude faster than baseline algorithms for general gap-constrained FSM and is competitive to state-of-the-art algorithms for n-gram mining.
For more information, please refer to our related publication.
Source code
The source code is hosted on MG-FSM GitHub repository.
Publications
- Iris Miliaraki, Klaus Berberich, Rainer Gemulla, and Spyros Zoupanos.
Mind the Gap: Large-Scale Frequent Sequence Mining.
To appear in the ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), New York City, New York, USA, 2013. [pdf, bib]