خلاصه
1. معرفی
2. پیشینه و کارهای مرتبط
3. مفاهیم اولیه و پیش نیازها
4. روش پیشنهادی
5. مطالعه تجربی
6. نتیجه گیری و کار آینده
منابع
Abstract
1. Introduction
2. Background and related work
3. Basic concepts and prerequisites
4. Proposed Method
5. Experimental study
6. Conclusion and Future Work
References
چکیده
صورت های مالی گزارش های تحلیلی هستند که به صورت دوره ای توسط مؤسسات مالی منتشر می شوند و عملکرد آنها را از دیدگاه های مختلف توضیح می دهند. از آنجایی که این گزارشها منبع اساسی برای تصمیمگیری توسط بسیاری از ذینفعان، اعتباردهندگان، سرمایهگذاران و حتی حسابرسان هستند، برخی موسسات ممکن است آنها را برای گمراه کردن مردم و ارتکاب کلاهبرداری دستکاری کنند. هدف کشف تقلب در صورتهای مالی کشف ناهنجاریهای ناشی از این تحریفها و تفکیک گزارشهای مستعد تقلب از گزارشهای غیر متقلبانه است. اگرچه طبقهبندی باینری یکی از محبوبترین روشهای داده کاوی در این زمینه است، اما به یک مجموعه داده برچسبدار استاندارد نیاز دارد که به دلیل نادر بودن نمونههای تقلبی، اغلب در دنیای واقعی در دسترس نیست. این مقاله یک رویکرد جدید مبتنی بر شبکههای متخاصم مولد (GAN) و مدلهای مجموعه پیشنهاد میکند که قادر است نه تنها کمبود نمونههای غیر تقلبی را حل کند، بلکه ابعاد بالای فضای ویژگی را نیز مدیریت کند. همچنین با جمعآوری صورتهای مالی سالانه ده بانک ایرانی و سپس استخراج سه نوع ویژگی پیشنهادی در این پژوهش، مجموعه داده جدیدی ساخته شده است. نتایج تجربی روی این مجموعه داده نشان میدهد که روش پیشنهادی در تولید نمونههای مصنوعی مستعد تقلب عملکرد خوبی دارد. علاوه بر این، در تشخیص دقیق نمونههای مستعد تقلب، عملکرد مقایسهای با مدلهای نظارت شده و عملکرد بهتری نسبت به مدلهای بدون نظارت دارد.
Abstract
Financial statements are analytical reports published periodically by financial institutions explaining their performance from different perspectives. As these reports are the fundamental source for decision-making by many stakeholders, creditors, investors, and even auditors, some institutions may manipulate them to mislead people and commit fraud. Fraud detection in financial statements aims to discover anomalies caused by these distortions and discriminate fraud-prone reports from non-fraudulent ones. Although binary classification is one of the most popular data mining approaches in this area, it requires a standard labeled dataset, which is often unavailable in the real world due to the rarity of fraudulent samples. This paper proposes a novel approach based on the generative adversarial networks (GAN) and ensemble models that is able to not only resolve the lack of non-fraudulent samples but also handle the high-dimensionality of feature space. A new dataset is also constructed by collecting the annual financial statements of ten Iranian banks and then extracting three types of features suggested in this study. Experimental results on this dataset demonstrate that the proposed method performs well in generating synthetic fraud-prone samples. Moreover, it attains comparative performance with supervised models and better performance than unsupervised ones in accurately distinguishing fraud-prone samples.
Introduction
Today, the incremental growth of fraud in business, especially in financial services, has become an earnest and costly problem. There exists no single definition for the concept of fraud in scientific sources. One of the clearest definitions available is the one provided by the Association of Certified Fraud Examiners (ACFE) in 2008. According to this definition, individuals and organizations may commit illegal actions such as deception or betrayal of trust for specific reasons, such as obtaining money, property, or individual or collective benefits, which are interpreted as fraud (Hashim et al., 2020, Sadgali et al., 2019, Syahria, 2019). The American Institute of Certified Public Accountants (AICPA) has also attributed the concept of fraud to any type of fraud, including minor employee theft, unproductive performance, embezzlement, misappropriation of assets, and fraudulent financial reporting (Hashim et al., 2020).
As it can be understood from the above definitions, there are variants of fraud, among which this study is focused on fraud in financial statements. Financial statements are reports that detail an organization's business activities and financial performance from various perspectives (Ashtiani and Raahemi, 2021, Jan, 2018). The most important contents of these reports include expenses, incomes, received or granted loans, profits, and losses (Ashtiani & Raahemi, 2021). These large amounts of numbers and figures provide an opportunity for profit seekers to cheat. Among the most common forms of fraud in financial statements are premature revenue recognition, spurious entries of incomes or profits, overstating assets, understating expenses, and concealment or false disclosure of expenses (Craja et al., 2020, Gray and Debreceny, 2014). According to the ranking provided by ACFE, financial statement fraud is the third most prevalent type of occupational fraud, after corruption and embezzlement (Hashim et al., 2020, Petković et al., 2021, Syahria, 2019). However, it has taken first place regarding the financial costs and the amount of loss it incurs (Omidi et al., 2019). Hence, early detection of this type of fraud can prevent its exorbitant financial consequences.
Conclusion and Future Work
In this paper, a new approach has been proposed to detect fraud in bank financial statements. The basic idea is to adopt generative adversarial networks instead of over-sampling, under-sampling, or one-class classification techniques to make the approach applicable to real-world scenarios whose data is highly imbalanced with no or few fraudulent samples. The second idea is to tackle with high-dimensionality of the feature space by leveraging the ensemble of supervised and unsupervised models. In particular, the proposed approach utilizes a kind of generative adversarial model called MO-GAAL to fabricate a set of fraud-prone samples that have unconventional behavior on the one hand and are difficult to distinguish from fraud-free samples on the other hand. Further, samples are classified by an ensemble model, namely XGBOD, in which the outlier scores of each sample are first estimated by a collection of unsupervised models, and then these scores form a new feature vector to be classified by a supervised model named XGBoost. In summary, the ability to train an efficient decision-making model even in the absence of actual fraudulent samples is the main advantage of this work.