Abstract
1- Introduction
2- Background
3- Methodology
4- Keyword extraction
5- Textual summarization
6- Toolkits and online resources
7- Conclusion
References
Abstract
With the advent of Web 2.0, there exist many online platforms that results in massive textual data production such as social networks, online blogs, magazines etc. This textual data carries information that can be used for betterment of humanity. Hence, there is a dire need to extract potential information out of it. This study aims to present an overview of approaches that can be applied to extract and later present these valuable information nuggets residing within text in brief, clear and concise way. In this regard, two major tasks of automatic keyword extraction and text summarization are being reviewed. To compile the literature, scientific articles were collected using major digital computing research repositories. In the light of acquired literature, survey study covers early approaches up to all the way till recent advancements using machine learning solutions. Survey findings conclude that annotated benchmark datasets for various textual data-generators such as twitter and social forms are not available. This scarcity of dataset has resulted into relatively less progress in many domains. Also, applications of deep learning techniques for the task of automatic keyword extraction are relatively unaddressed. Hence, impact of various deep architectures stands as an open research direction. For text summarization task, deep learning techniques are applied after advent of word vectors, and are currently governing state-of-the-art for abstractive summarization. Currently, one of the major challenges in these tasks is semantic aware evaluation of generated results.
Introduction
Due to advent of Word Wide Web (WWW) and later Web 2.0, there currently exists wide variety of platforms that are resulting in enormous data generation. Social networking websites such as Facebook, Twitter are generating terabytes of data. Similarly questionanswering websites such as Quora and StackOverFlow also data is being produced abundantly by means of social networking sites, question-answering engines and various sharing portals. It is expected that, by 2020, total data generated would be around forty four zeta-bytes (Waterford Technologies, 2017; Marr, 2019). As humans tend to communicate by means of various data forms including images, videos, sound and textual streams on various sites over the internet, this data carries a huge value. Amongst the various data types, this study is primarily focused on textual data. Many question-answering systems, news-wire agencies, blogging websites, research engines, digital libraries and e-commerce websites share most of their data in form of text. The potential information hidden in these bulks of textual data can be used in order to perform variety of tasks. For example, in the domain of e-commerce, retailers and product manufacturers can get a better idea about customer's biases by means of customer review's analysis. This type of analysis can help in optimization of products and its respective features. Similarly, in question-answering search engines, textual analysis can help in identifying keywords and generating short summaries.