نرمال سازی حساس به متن از متن رسانه های اجتماعی در زبان اندونزی
ترجمه نشده

نرمال سازی حساس به متن از متن رسانه های اجتماعی در زبان اندونزی

عنوان فارسی مقاله: نرمال سازی حساس به متن از متن رسانه های اجتماعی در زبان اندونزی بر اساس جاسازی عصبی کلمات
عنوان انگلیسی مقاله: Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings
مجله/کنفرانس: پروسدیای علوم کامپیوتر - Procedia Computer Science
رشته های تحصیلی مرتبط: مهندسی کامپیوتر، مهندسی فناوری اطلاعات
گرایش های تحصیلی مرتبط: مهندسی الگوریتم ها و محاسبات، اینترنت و شبکه های گسترده، مهندسی نرم افزار، برنامه نویسی کامپیوتر
کلمات کلیدی فارسی: رسانه های اجتماعی، زبان اندونزی، Word2Vec، نرمال سازی، جاسازی کلمات، یادگیری عمیق
کلمات کلیدی انگلیسی: Social Media، Bahasa Indonesia، Word2Vec، Normalization، Word Embeddings، Deep Learning
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.procs.2018.10.510
دانشگاه: Department of Information Systems, Faculty of Information and Communication Technology, Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia
صفحات مقاله انگلیسی: 13
ناشر: الزویر - Elsevier
نوع ارائه مقاله: کنفرانس
نوع مقاله: ISI
سال انتشار مقاله: 2018
ایمپکت فاکتور: 1/013 در سال 2017
شاخص H_index: 34 در سال 2019
شاخص SJR: 0/258 در سال 2017
شناسه ISSN: 1877-0509
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
کد محصول: E11185
فهرست مطالب (انگلیسی)

Abstract

1- Introduction

2- Related Works

3- Neural Word Embeddings

4- Normalization System Architecture

5- Performance of the System

6- Conclusion and Future Work

References

بخشی از مقاله (انگلیسی)

Abstract

We present our work in the normalization of social media texts in Bahasa Indonesia. To capture the contextual meaning of tokens, we create a neural word embeddings using word2vec trained on over a million social media messages representing a mix of domains and degrees of linguistic deviations from standard Bahasa Indonesia. For each token to be normalized, the embeddings is used for generating candidates from vocabulary words. To select from among these candidates, we use a scoring combining their contextual similarity to the token as gauged by their proximity in the embeddings vector space with their orthographical similarity measured using the Levenshtein and Jaro-Winkler distances. For normalization of individual words, we observe that detecting whether a token actually represent an incorrectly spelled word is at least as important as finding the correct normalization. However, in the task of normalizing entire messages, the system achieves a highest accuracy of 79.59%, suggesting that our approach is quite promising and worthy of further exploration. Furthermore, in this paper we also discuss some observations we made on the use of the neural word embeddings in the processing of informal Bahasa Indonesia texts, especially in the social media.

Introduction

Texts in the social media offers a rich source of information and insight into current events and people’s opinions. However, they are known not to follow conventional rules of languages, due to their brevity, often due to a constraint on their length, and the users’ need to convey information and emotion as quickly and expressively as possible. The deviations occur not only in the grammatical structure, but notably, also in the way various words are spelled. As the variations to the standard language is often dependent on the domain, NLP models used for processing social media texts are usually trained for a specific domain and application, and perform poorly outside their original purpose. This limits the applicability of language resources developed for social media texts. Furthermore, it also limits the incorporation of NLP tools trained on standard language variety. One method which allows a more generalized processing is to normalize social media texts into a standard form of the language, which is the aim of this paper. Specifically, we attempt to normalize tokens found in social media texts in Bahasa Indonesia to a standard word with the same semantic meaning. Since the understanding of human readers are often determined by the context words surrounding the token, we use vector representations resulting from neural word embeddings which take words in a proximity of the token into consideration. The embeddings are trained on over a million tweets from Indonesian accounts using primarily Bahasa Indonesia, representing a mix of different domains and degrees of language deviations. A standard word could only be selected as a substitute if its embedding representation has a high similarity to the token. A substitute to the token is then selected from among these contextually similar words based on its orthographical similarity to the token. From our experiments, we find that embeddings trained against the goal of modeling with CBOW gives better performance for word normalization when compared to Skip-Gram. This, in combination to other elements of our method, gives the best accuracy of 79.56% in the normalization of entire messages. However, in the task of normalizing individual tokens, it is very important to first ensure, at least to some degree, that the tokens are actually incorrectly spelled words, instead of proper nouns or words of another language. This paper is organized as follows. After this introduction, some related works are briefly discussed, followed by a short exposition on neural word embeddings. We then discuss the architecture of our normalization system, with more in-depth discussion on the data and the learning of word embeddings, the creation of the Indonesian lexicon, and the scoring used to select from the normalization candidates. This is followed by a discussion on the performance of the system, both in normalizing individual words and in the normalization of entire messages. Finally, we discuss some conclusions of this work and suggest some venues for future work.