سیستم N-Gram مبتنی بر سرویس غلط یابی

عنوان فارسی مقاله: سیستم N-Gram پویا مبتنی بر یک سرویس غلط یابی آنلاین زبان کرواسی

عنوان انگلیسی مقاله: Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

مجله/کنفرانس: دسترسی – IEEE Access

رشته های تحصیلی مرتبط: مهندسی کامپیوتر

گرایش های تحصیلی مرتبط: معماری سیستم های کامپیوتری

کلمات کلیدی فارسی: زبان کرواسی، قانون هپس، مدلسازی زبان، n-gram واژگانی، مقایسه سیستم n-gram

کلمات کلیدی انگلیسی: Croatian language, Heaps’ law, language modeling, lexical n-gram, n-gram system comparison

نوع نگارش مقاله: مقاله پژوهشی (Research Article)

شناسه دیجیتال (DOI): https://doi.org/10.1109/ACCESS.2019.2947898

دانشگاه: Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb 10000, Croatia

صفحات مقاله انگلیسی: 8

ناشر: آی تریپل ای - IEEE

نوع ارائه مقاله: ژورنال

نوع مقاله: ISI

سال انتشار مقاله: 2019

ایمپکت فاکتور: 4.641 در سال 2018

شاخص H_index: 56 در سال 2019

شاخص SJR: 0.609 در سال 2018

شناسه ISSN: 2169-3536

شاخص Quartile (چارک): Q2 در سال 2018

فرمت مقاله انگلیسی: PDF

وضعیت ترجمه: ترجمه نشده است

قیمت مقاله انگلیسی: رایگان

آیا این مقاله بیس است: خیر

آیا این مقاله مدل مفهومی دارد: ندارد

آیا این مقاله پرسشنامه دارد: ندارد

آیا این مقاله متغیر دارد: ندارد

کد محصول: E13871

رفرنس: دارای رفرنس در داخل متن و انتهای مقاله

فهرست مطالب (انگلیسی)

Abstract

I. Introduction

II. Conventionally Created Croatian Corpora

III. About the Spellchecker

IV. Croatian N-Gram System Characteristics

V. Heaps’ Law Applied to Croatian N-Grams

Authors

Figures

References

بخشی از مقاله (انگلیسی)

Abstract

As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world’s largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps’ law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Introduction

Lexical n-grams are nowadays an important data infrastructure in many areas of natural language processing (NLP), machine learning, text analytics, and data mining [1]. Many technologies take advantage of large-scale language models based on huge n-gram systems derived from gigantic corpora. ‘‘More words and less linguistic annotation’’ is a trend well expressed in [2]. The trend is strictly followed in the research presented here. Besides English [3], structured big data are the privilege of a dozen languages most advanced in NLP, those treated in the Google n-gram project [4]–[6]. Abundant linguistic data collection is a prerequisite for large-scale language modeling, but in many cases, it is hardly a feasible step in the machine processing of minority languages such as Croatian, which belongs to the subfamily of South Slavic languages and has approximately 4.5 million users, or less than 0.1% of the world’s population. It is clear that an enormous English or Chinese text corpus cannot be comparable in size with a Croatian one due to differences in the numbers of language users. However, statistical machine translation or speech recognition asks for language models of comparable size in order to produce the desired effectiveness. This means the n-gram system, from which language models are derived, in a minority language must be enriched to approximately the size of n-gram systems for world major languages.

دانلود رایگان مقاله انگلیسی

سفارش ترجمه این مقاله

مشاهده خریدهای قبلی

مقالات مشابه

نماد اعتماد الکترونیکی

پیوندها