Abstract
I. Introduction
II. Related Work
III. Problem Definition and Preliminaries
IV. Scope-Based Convolutional Neural Network
V. Experiments
Authors
Figures
References
Abstract
Text classification is one of the most important and typical tasks in Natural Language Processing (NLP) which can be applied for many applications. Recently, deep learning approaches has shown their advantages in solving text classification problem, in which Convolutional Neural Network (CNN) is one of the most successful model in the field. In this paper, we propose a novel deep learning approach for categorizing text documents by using scope-based convolutional neural network. Different from windowbased CNN, scope does not require the words that construct a local feature have to be contiguous. It can represent deeper local information of text data. We propose a large-scale scope-based convolutional neural network (LSS-CNN), which is based on scope convolution, aggregation optimization, and max pooling operation. Based on these techniques, we can gradually extract the most valuable local information of the text document. This paper also discusses how to effectively calculate the scope-based information and parallel training for large-scale datasets. Extensive experiments have been conducted on real datasets to compare our model with several state-of-the-art approaches. The experimental results show that LSS-CNN can achieve both effectiveness and good scalability on big text data.
Introduction
The task of text classification (a.k.a. text tagging, text filtering or text categorization) is a process of categorizing a text document into one or multiple predefined categories based on the content. Concretely, the target is to build a classifier which takes a text document as an input, then automatically assigns relevant labels according its content. These text documents can be emails, comments, or movie reviews. Accordingly the labels can be spam/non-spam, positive/negative/neutral or review scores. Text classification plays an important role in Natural Language Processing (NLP). It is widely adopted in many applications. For example, most of news services today needs to automatically organize a large volume of new articles every single day [1]. All the modern mail services provide a function to determine either a mail is a junk mail automatically [2]. Other applications include sentiment analysis [3], topic modelling [4], language translation [5], and intent detection [6], etc. Text classification is a challenge problem. The sparse, high dimensional, and existence of irrelevant or noisy characteristics of text make it a non-trivial job to develop a good classifier for large-scale text data. Because of its importance and challenge, a lot of methods range from traditional feature engineering classification methods [7]–[10] with hand-crafted features to emerging deep learning methods [1], [11]–[15] have been proposed to solve the problem.