سیستم N-Gram مبتنی بر سرویس غلط یابی
ترجمه نشده

سیستم N-Gram مبتنی بر سرویس غلط یابی

عنوان فارسی مقاله: سیستم N-Gram پویا مبتنی بر یک سرویس غلط یابی آنلاین زبان کرواسی
عنوان انگلیسی مقاله: Dynamic N-Gram System Based on an Online Croatian Spellchecking Service
مجله/کنفرانس: دسترسی – IEEE Access
رشته های تحصیلی مرتبط: مهندسی کامپیوتر
گرایش های تحصیلی مرتبط: معماری سیستم های کامپیوتری
کلمات کلیدی فارسی: زبان کرواسی، قانون هپس، مدلسازی زبان، n-gram واژگانی، مقایسه سیستم n-gram
کلمات کلیدی انگلیسی: Croatian language, Heaps’ law, language modeling, lexical n-gram, n-gram system comparison
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1109/ACCESS.2019.2947898
دانشگاه: Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb 10000, Croatia
صفحات مقاله انگلیسی: 8
ناشر: آی تریپل ای - IEEE
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2019
ایمپکت فاکتور: 4.641 در سال 2018
شاخص H_index: 56 در سال 2019
شاخص SJR: 0.609 در سال 2018
شناسه ISSN: 2169-3536
شاخص Quartile (چارک): Q2 در سال 2018
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
آیا این مقاله مدل مفهومی دارد: ندارد
آیا این مقاله پرسشنامه دارد: ندارد
آیا این مقاله متغیر دارد: ندارد
کد محصول: E13871
رفرنس: دارای رفرنس در داخل متن و انتهای مقاله
فهرست مطالب (انگلیسی)

Abstract

I. Introduction

II. Conventionally Created Croatian Corpora

III. About the Spellchecker

IV. Croatian N-Gram System Characteristics

V. Heaps’ Law Applied to Croatian N-Grams

Authors

Figures

References

بخشی از مقاله (انگلیسی)

Abstract

As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world’s largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps’ law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Introduction

Lexical n-grams are nowadays an important data infrastructure in many areas of natural language processing (NLP), machine learning, text analytics, and data mining [1]. Many technologies take advantage of large-scale language models based on huge n-gram systems derived from gigantic corpora. ‘‘More words and less linguistic annotation’’ is a trend well expressed in [2]. The trend is strictly followed in the research presented here. Besides English [3], structured big data are the privilege of a dozen languages most advanced in NLP, those treated in the Google n-gram project [4]–[6]. Abundant linguistic data collection is a prerequisite for large-scale language modeling, but in many cases, it is hardly a feasible step in the machine processing of minority languages such as Croatian, which belongs to the subfamily of South Slavic languages and has approximately 4.5 million users, or less than 0.1% of the world’s population. It is clear that an enormous English or Chinese text corpus cannot be comparable in size with a Croatian one due to differences in the numbers of language users. However, statistical machine translation or speech recognition asks for language models of comparable size in order to produce the desired effectiveness. This means the n-gram system, from which language models are derived, in a minority language must be enriched to approximately the size of n-gram systems for world major languages.