عملکرد نگاشت کاهش هادوپ در SSD ها برای تحلیل شبکه های اجتماعی
ترجمه نشده

عملکرد نگاشت کاهش هادوپ در SSD ها برای تحلیل شبکه های اجتماعی

عنوان فارسی مقاله: عملکرد نگاشت کاهش هادوپ در SSD ها برای تحلیل شبکه های اجتماعی
عنوان انگلیسی مقاله: Hadoop MapReduce Performance on SSDs for Analyzing Social Networks
مجله/کنفرانس: تحقیقات کلان داده - Big Data Research
رشته های تحصیلی مرتبط: مهندسی فناوری اطلاعات، مهندسی کامپیوتر
گرایش های تحصیلی مرتبط: اینترنت و شبکه های گسترده، مهندسی نرم افزار، شبکه های کامپیوتری، معماری سیستم های کامپیوتری
کلمات کلیدی فارسی: نگاشت کاهش، هادوپ، دیسک های حالت جامد، دیسک های مغناطیسی، شبکه های اجتماعی، کلان داده
کلمات کلیدی انگلیسی: MapReduce، Hadoop، Solid state disks، Magnetic disks، Social networks، Big data
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.bdr.2017.06.001
دانشگاه: University of Thessaly, Department of Electrical & Computer Engineering, Volos, Greece
صفحات مقاله انگلیسی: 26
ناشر: الزویر - Elsevier
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2018
ایمپکت فاکتور: 7/184 در سال 2017
شاخص H_index: 12 در سال 2019
شاخص SJR: 0/757 در سال 2017
شناسه ISSN: 2214-5796
شاخص Quartile (چارک): Q1 در سال 2017
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
کد محصول: E11089
فهرست مطالب (انگلیسی)

Abstract

1- Introduction

2- Related work

3- Hadoop structure

4- Investigated algorithms

5- Experimental environment and results

6- Conclusions

References

بخشی از مقاله (انگلیسی)

Abstract

The advent of Solid State Drives (SSDs) stimulated a lot of research to investigate and exploit to the extent possible the potentials of the new drive. The focus of this work is on the investigation of the relative performance and benefits of SSDs versus hard disk drives (HDDs) when they are used as underlying storage for Hadoop's MapReduce. In particular, we depart from all earlier relevant works in that we do not use their workloads, but examine MapReduce tasks and data suitable for performing analysis of complex networks which present different execution patterns. Despite the plethora of algorithms and implementations for complex network analysis, we carefully selected our “benchmarking methods” so that they include methods that perform both local and network-wide operations in a complex network, and also they are generic enough in the sense that they can be used as primitives for more sophisticated network processing applications. We evaluated the performance of SSDs and HDDs by executing these algorithms on real social network data and excluding the effects of network bandwidth which can severely bias the results. The obtained results confirmed in part earlier studies which showed that SSDs are beneficial to Hadoop. However, we also provided solid evidence that the processing pattern of the running application has a significant role, and thus future studies must not blindly add SSDs to Hadoop, but they should build components for assessing the type of processing pattern of the application and then direct the data to the appropriate storage medium.

Introduction

A complex network is a graph with topological features such as scale-free properties, existence of communities, hubs, and so on that is used to model real systems, for example, technological (Web, Internet, power grid, online social networks) networks, biological networks (gene, protein), social networks [23]. The analysis of online social networks (OSNs) such as Facebook, Twitter, Instagram has received significant attention because all these networks store and process colossal volumes of data, mainly in the form of pair-wise interactions, thus giving birth to networks, i.e., graphs which record persons’ interactions whose analysis and mining offers both operational and business advantages to the OSN owner. Modern OSNs are comprised by millions of nodes and even billions of edges; therefore any algorithm for their analysis that relies on a single machine (centralized) - exploiting solely the machine’s main memory and/or its disk - is eventually doomed to fail due to lack of resources. Thus, the digitization of the aforementioned relationships produces a vast amount of collected data, i.e., big data [9] requiring extreme processing power that only distributed computing can offer. However, developing a distributed solution is a challenging task because it must deal sometimes with sequential processes. Some analysis algorithms based on distributed solutions that can run only on a small cluster of machines are still insufficient, since modern OSNs are maintained by Internet giants such as Google, LinkedIn and Facebook who own huge datacenters and operate clusters of several thousand machines. These clusters are usually programmed by data-parallel frameworks of the MapReduce type [4], a big data analytics platform. The Hadoop [29] middleware was designed to solve problems where the “same, repeated processing” had to be applied to peta-scale volumes of data. Hadoop’s initial design was based on magnetic disk’s characteristics, enforcing sequential read and write operations introducing its own distributed file system (HDFS - Hadoop Distributed File System) with blocks of large size. Recently with the advent of faster Solid State Drives (SSDs) research is emerging to test and possibly to exploit the potential of the new technologically advanced drive [11, 12, 21, 33]. The lack of seek overhead gives them a significant advantage with respect to Hard Disk Drives (HDDs) for workloads whose processing requires random access instead of sequential access.