Abstract
1. Introduction
2. Overview
3. Formal definitions
4. Sequential trace slicing
5. Distributed trace slicing with MapReduce
6. Distributed model synthesis with MapReduce
7. Experimental evaluation
8. Discussion
9. Related works
10. Conclusion
Acknowledgements
Appendix A. Correctness of distributed trace slicing
Appendix B. Correctness of distributed model synthesis
References
Abstract
In the real world practice, software systems are often built without developing any explicit upfront model. This can cause serious problems that may hinder the almost inevitable future evolution, since at best the only documentation about the software is in the form of source code comments. To address this problem, research has been focusing on automatic inference of models by applying machine learning algorithms to execution logs. However, the logs generated by a real software system may be very large and the inference algorithm can exceed the processing capacity of a single computer.
Introduction
Software behavior models play an important role in the whole life cycle of software systems. Through models, software engineers may gain a deep understanding of how a system behaves without dealing with the intricacies of the implementation. Although good software engineering practices suggest that models should be developed upfront, before deriving an implementation, reality shows that often models do not exist, or they are inconsistent with the implementation. In fact, building a proper model is costly, hard, and requires both mathematical skills and ingenuity. Moreover, even if models are developed, they are often not updated with the changes in the implementation and therefore the models and the implementation progressively diverge. Model inference is a promising approach to tackle this problem by using machine learning to infer software behavior models automatically from execution logs [1–3]. Many model inference algorithms [4–6] have been proposed by recent research. To infer accurate models, the logs should contain as much detail information as possible. However, a log with more information also increases the difficulty of model inference task. The logs generated by real systems are usually very large. For example, Prospex [7] infers state machines from network logs for vulnerability analysis of network applications. In practice, network logs collected passively can be enormous, while models need to be inferred quickly to ensure timeliness of subsequent analyses.