Abstract
1- Introduction
2- Related Works
3- Neural Word Embeddings
4- Normalization System Architecture
5- Performance of the System
6- Conclusion and Future Work
References
Abstract
We present our work in the normalization of social media texts in Bahasa Indonesia. To capture the contextual meaning of tokens, we create a neural word embeddings using word2vec trained on over a million social media messages representing a mix of domains and degrees of linguistic deviations from standard Bahasa Indonesia. For each token to be normalized, the embeddings is used for generating candidates from vocabulary words. To select from among these candidates, we use a scoring combining their contextual similarity to the token as gauged by their proximity in the embeddings vector space with their orthographical similarity measured using the Levenshtein and Jaro-Winkler distances. For normalization of individual words, we observe that detecting whether a token actually represent an incorrectly spelled word is at least as important as finding the correct normalization. However, in the task of normalizing entire messages, the system achieves a highest accuracy of 79.59%, suggesting that our approach is quite promising and worthy of further exploration. Furthermore, in this paper we also discuss some observations we made on the use of the neural word embeddings in the processing of informal Bahasa Indonesia texts, especially in the social media.
Introduction
Texts in the social media offers a rich source of information and insight into current events and people’s opinions. However, they are known not to follow conventional rules of languages, due to their brevity, often due to a constraint on their length, and the users’ need to convey information and emotion as quickly and expressively as possible. The deviations occur not only in the grammatical structure, but notably, also in the way various words are spelled. As the variations to the standard language is often dependent on the domain, NLP models used for processing social media texts are usually trained for a specific domain and application, and perform poorly outside their original purpose. This limits the applicability of language resources developed for social media texts. Furthermore, it also limits the incorporation of NLP tools trained on standard language variety. One method which allows a more generalized processing is to normalize social media texts into a standard form of the language, which is the aim of this paper. Specifically, we attempt to normalize tokens found in social media texts in Bahasa Indonesia to a standard word with the same semantic meaning. Since the understanding of human readers are often determined by the context words surrounding the token, we use vector representations resulting from neural word embeddings which take words in a proximity of the token into consideration. The embeddings are trained on over a million tweets from Indonesian accounts using primarily Bahasa Indonesia, representing a mix of different domains and degrees of language deviations. A standard word could only be selected as a substitute if its embedding representation has a high similarity to the token. A substitute to the token is then selected from among these contextually similar words based on its orthographical similarity to the token. From our experiments, we find that embeddings trained against the goal of modeling with CBOW gives better performance for word normalization when compared to Skip-Gram. This, in combination to other elements of our method, gives the best accuracy of 79.56% in the normalization of entire messages. However, in the task of normalizing individual tokens, it is very important to first ensure, at least to some degree, that the tokens are actually incorrectly spelled words, instead of proper nouns or words of another language. This paper is organized as follows. After this introduction, some related works are briefly discussed, followed by a short exposition on neural word embeddings. We then discuss the architecture of our normalization system, with more in-depth discussion on the data and the learning of word embeddings, the creation of the Indonesian lexicon, and the scoring used to select from the normalization candidates. This is followed by a discussion on the performance of the system, both in normalizing individual words and in the normalization of entire messages. Finally, we discuss some conclusions of this work and suggest some venues for future work.