Abstract
1. Introduction
2. Categories of queries
3. Related studies
4. Customised grammar framework
5. CGF for query classification
6. Experiments
7. Performance comparison
8. Discussion
9. Conclusions and future work
CRediT authorship contribution statement
Declaration of Competing Interest
Appendix A. Grammar terms and corresponding abbreviations
References
Abstract
In real-life classification problems, prior information about the problem and expert knowledge about the domain are often used to obtain reliable and consistent solutions. This is especially true in fields where the data is ambiguous, such as text, in which the same words can be used in seemingly similar texts, but have a different meaning. A promising avenue for text classification is machine learning, which has been shown to perform well in a variety of applications including query classification and sentiment analysis. Many of the proposed approaches rely on the bag-of-words representation, which loses the information about the structure of the text. In this paper, we propose a Customised Grammar Framework for text classification, which exploits domain-related information and a new way to represent text as a series of syntactic categories forming syntactic patterns. The framework employs a formal grammar approach for transforming the text into the syntactic patterns representation. We applied the framework for the query classification problem and our results show that our approach outperforms previous ones in terms of classification performance.
Introduction
In many classification real-world problems, some prior information about the structure of the problem are known in advance, such as the relation between some attributes or the patterns that are likely to appear in certain instances. Moreover, the features extracted from many real-world problems are not completely independent and the meaning of each feature may be influenced by other attributes and/or the position of the attribute in the instance. For example, in signal processing, the same set of signal features may have different meanings (and thus, belong to different classes) depending on the sequence in which these features appear in the signal. Another example is text classification – in addition to words in the text, the syntax plays an important role in defining the meaning of the text. Text classification is an important task in Natural Language Processing with many applications, such as web search (e.g. Hernández, Gupta, Rosso, & Rocha, 2012; Højgaard, Sejr, & Cheong, 2016; Shi, Yao, Tian, & Jiang, 2016; Wu, Zhang, Zhao, & Liu, 2010), question–answering (e.g. Hardy & Cheah, 2013; Li, Su, Chen, & Yuan, 2017; Zhang & Lee, 2003), sentiment analysis (e.g. Altrabsheh, Cocea, & Fallahkhair, 2014; Glorot, Bordes, & Bengio, 2011; Taboada, Brooke, Tofiloski, Voll, & Stede, 2011; Yang et al., 2017). However, traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels rather than the relations between the entities, as well as the types of the entities and relations which carry much more information to represent the texts (Wang, Song, Li, Zhang, & Han, 2016). The selection of distinctive features is essential for text classification (Uysal, 2016; Uysal & Gunal, 2012).