چکیده
1. مقدمه
2 پس زمینه
3 بازده مقیاس شده مورد انتظار
4 تسلط تصادفی برای ESR
5 مجموعه راه حل برای ESR
6 یادگیری تقویتی توزیعی جدولی چند هدفه
7 آزمایش
8 کارهای مرتبط
9 نتیجه گیری و کار آینده
ضمیمه
اعلام
منابع
Abstract
1 Introduction
2 Background
3 Expected scalarised returns
4 Stochastic dominance for ESR
5 Solution sets for ESR
6 Multi-objective tabular distributional reinforcement learning
7 Experiments
8 Related work
9 Conclusion and future work
Appendix
Declaration
References
چکیده
در بسیاری از سناریوهای دنیای واقعی، سودمندی یک کاربر از یک اجرای منفرد یک خط مشی مشتق می شود. در این مورد، برای اعمال یادگیری تقویتی چند هدفه، مطلوبیت مورد انتظار بازده باید بهینه شود. سناریوهای مختلفی وجود دارد که در آن ترجیحات کاربر بر اهداف (همچنین به عنوان تابع مفید شناخته میشود) ناشناخته است یا تعیین آن دشوار است. در چنین سناریوهایی باید مجموعه ای از سیاست های بهینه را آموخت. با این حال، تنظیماتی که در آن مطلوبیت مورد انتظار باید به حداکثر برسد، تا حد زیادی توسط جامعه یادگیری تقویتی چند هدفه نادیده گرفته شده است و در نتیجه، مجموعه ای از راه حل های بهینه هنوز تعریف نشده است. در این کار، ما تسلط تصادفی مرتبه اول را به عنوان معیاری برای ساخت مجموعههای راهحل برای به حداکثر رساندن مطلوبیت مورد انتظار پیشنهاد میکنیم. ما همچنین یک معیار تسلط جدید را تعریف می کنیم که به عنوان تسلط بازده مقیاس شده مورد انتظار (ESR) شناخته می شود، که تسلط تصادفی مرتبه اول را گسترش می دهد تا مجموعه ای از سیاست های بهینه را در عمل یاد بگیرند. علاوه بر این، ما یک مفهوم راه حل جدید به نام مجموعه ESR را تعریف می کنیم که مجموعه ای از سیاست هایی است که ESR غالب هستند. در نهایت، ما یک الگوریتم یادگیری تقویتی توزیعی توزیعی چند هدفه جدید (MOTDRL) را برای یادگیری مجموعه ESR در تنظیمات راهزن چند هدفه ارائه می کنیم.
توجه! این متن ترجمه ماشینی بوده و توسط مترجمین ای ترجمه، ترجمه نشده است.
Abstract
In many real-world scenarios, the utility of a user is derived from a single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user’s preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this work, we propose first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also define a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. Additionally, we define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we present a new multi-objective tabular distributional reinforcement learning (MOTDRL) algorithm to learn the ESR set in multi-objective multi-armed bandit settings.
Introduction
When making decisions in the real world, decision makers must make trade-offs between multiple, often conflicting, objectives [44]. In many real-world settings, a policy is only executed once. For example, consider a municipality that receives the majority of its electricity from local solar farms. To deal with the intermittency of the solar farms, the municipality wants to build a new electricity generation facility. The municipality are considering two choices: building a natural gas facility or adding a lithium-ion battery storage facility to the solar farms. Moreover, the municipality want to minimise CO2 emissions while ensuring energy demand can continuously be met. Given a new energy generation facility will only be constructed once, a full distribution over each potential outcome for capacity to meet electricity demand and CO2 emissions must be considered to make an optimal decision. The current state-of-the-art multi-objective reinforcement learning (MORL) literature focuses almost exclusively on learning polices that are optimal over multiple executions. Given such problems are salient, to fully utilise MORL in the real world, we must develop algorithms to compute a policy, or set of policies, that are optimal given the single-execution nature of the problem.
Conclusion and future work
MORL has been highlighted as one of several key challenges that need to be addressed in order for RL to be commonly deployed in real-world systems [12]. In order to apply RL to the real world, the MORL community must consider the ESR criterion. However, the ESR criterion has largely been ignored by the MORL community, with the exception of the works of Roijers et al. [33, 36], Hayes et al. [15, 16] and Vamplew et al. [43]. The works of Hayes et al. [15, 16] and Roijers et al. [33] present single-policy algorithms that are suitable to learn policies under the ESR criterion; however, prior to this work, a formal definition of the necessary requirements to compute policies under the ESR criterion had not previously been defined. In Sect. 3, we outline, through examples and definitions, the necessary requirements to optimise under the ESR criterion. The formal definitions outlined in Sect. 3 ensure that an optimal policy can be learned when the utility function of the user is known under the ESR criterion. However, in the real world, a user’s preferences over objectives (or utility function) may be unknown at the time of learning [36].
Prior to this paper, a suitable solution set for the unknown utility function scenario under the ESR criterion had not been defined. This long-standing research gap has restricted the applicability of MORL in real-world scenarios under the ESR criterion. In Sects. 4 and 5, we define the necessary solution sets required for multi-policy algorithms to learn a set of optimal policies under the ESR criterion when the utility function of a user is unknown. In Sect. 6, we present a novel multi-policy algorithm, known as multi-objective tabular distributional reinforcement learning (MOTDRL), that can learn the ESR set in a MOMAB setting when the utility function of a user is unknown at the time of learning. In Sect. 7, we evaluate MOTDRL in two MOMAB settings and show that MOTDRL can learn the ESR set in MOMAB settings. This work aims to answer some of the existing research questions regarding the ESR criterion. Moreover, we aim to highlight the importance of the ESR criterion when applying MORL to real-world scenarios. In order to successfully apply MORL to the real world, we must implement new single-policy and multi-policy algorithms that can learn solutions for nonlinear utility functions in various scenarios.