خلاصه
مقدمه
II. الگوریتم های آموزشی توزیع شده
III. آزمایش کنید
IV. نتایج و بحث
نتیجه گیری
منابع
Abstract
I. INTRODUCTION
II. DISTRIBUTED TRAINING ALGORITHMS
III. EXPERIMENT
IV. RESULTS AND DISCUSSION
V. CONCLUSION
REFERENCES
چکیده
اخیراً، تحقیقات یادگیری عمیق نشان داده است که توانایی آموزش مدل های بزرگ عملکرد را به طور قابل ملاحظه ای بهبود می بخشد. در این کار، ما مشکل آموزش یک شبکه عصبی عمیق با میلیونها پارامتر با استفاده از چندین هسته CPU را در نظر میگیریم. در یک ماشین منفرد با یک پلتفرم CPU مدرن، آموزش مجموعه داده های معیار سگ ها در مقابل گربه ها می تواند تا ساعت ها طول بکشد. با این حال، مشاهده شده است که توزیع آموزش در بین ماشین های متعدد این زمان را به طور چشمگیری کاهش می دهد. وضعیت فعلی هنر برای چارچوب آموزشی توزیع شده مدرن در این مطالعه ارائه شده است که بسیاری از روش ها و استراتژی های مورد استفاده برای توزیع آموزش را پوشش می دهد. ما بر روی نسخههای همزمان از توزیع گرادیان تصادفی، الگوریتمهای مختلف تجمیع گرادیان و بهترین روشها برای دستیابی به توان عملیاتی بالاتر و تأخیر کاهشیافته، مانند فشردهسازی گرادیان و اندازههای دسته بزرگ تمرکز میکنیم. ما نشان میدهیم که با استفاده از همین رویکردها، میتوانیم یک شبکه عمیق کوچکتر را برای یک مشکل طبقهبندی تصویر در زمان کوتاهتری آموزش دهیم. اگر چه زمانی که برای آموزش شبکههای عصبی پیچشی استفاده میشود، روی کارآمدی این رویکردها تمرکز میکنیم و روی آنها گزارش میدهیم، روشهای اساسی ممکن است برای آموزش هر الگوریتم یادگیری ماشینی مبتنی بر گرادیان مورد استفاده قرار گیرند.
توجه! این متن ترجمه ماشینی بوده و توسط مترجمین ای ترجمه، ترجمه نشده است.
Abstract
Recently, deep learning research has demonstrated that being able to train big models improves performance substantially. In this work, we consider the problem of training a deep neural network with millions of parameters using multiple CPU cores. On a single machine with a modern CPU platform, training a benchmark dataset of Dogs vs Cats can take up to hours; however, distributing training across numerous machines has been seen to dramatically reduce this time. The current state of the art for a modern distributed training framework is presented in this study, which covers the many methods and strategies utilized to distribute training. We concentrate on synchronous versions of distributed Stochastic Gradient Descent, different All Reduce gradient aggregation algorithms, and best practices for achieving higher throughput and reduced latency, such as gradient compression and large batch sizes. We show that using the same approaches, we can train a smaller deep network for an image classification problem in a shorter time. Although we focus on and report on the effectiveness of these approaches when used to train convolutional neural networks, the underlying methods may be used to train any gradient-based machine-learning algorithm.
Introduction
Recently, in a wide range of applications, including speech recognition, computer vision, text processing, and natural language processing, deep learning has outperformed classical Machine Learning models in creating models to address complicated problems. Despite significant progress in customizing neural networks designs, there is still one major drawback: training big NNs is memory and time intensive. The training of NNs in a distributed way is one answer to this problem. The purpose of distributed deep learning systems (DDLS) is to scale out the training of big models by combining the resources of several separate computers. As a result, several of the DDLS presented in the literature use various ways to implement distributed model training [1]. Training times have increased substantially as models and datasets have become more sophisticated, sometimes weeks or even months on a single GPU. To address this issue, two techniques proposed by many researchers for scaling out big deep learning workloads are model and data parallelism. Model parallelism seeks to transfer model execution stages onto cluster hardware, whereas data-parallel methods treat collaborative model training as a concurrency/synchronization challenge [1]. The main idea behind data parallelism is to enhance the overall sample throughput rate by duplicating the model over several computers and performing backpropagation in parallel to acquire more information about the loss function more quickly. It is achieved in the following way. Each cluster node begins by downloading the current model. Then, utilizing its parallel data assignment, each node executes backpropagation. Finally, the various results are combined and merged to create a new model [2].
CONCLUSION
Data parallelism techniques using asynchronous algorithms have been widely employed to expedite the training of deep learning models. To enhance data throughput while ensuring computing efficiency in each worker, scale up techniques rely on tight hardware integration. Increasing the batch size, on the other hand, may results in a loss in test accuracy, which may be mitigated by a number of recent concepts, such as increasing the learning rate throughout the training process and using a learning rate warm up technique.