In recent years scientific computing has evolved into a massive usage of cloud computing, due to its flexibility in managing computing resources. In this paper, we focus on genomic data processing, which is rapidly gaining momentum in research and medical activities. The main characteristics of these data sets is that not only the number of available genome files is becoming extremely large, but also each individual data set is significantly large, in the order of tens of GB. Hence, a wide diffusion of cloud-based genomic data processing will have a significant impact on network resources, since each processing request will require the transfer of tens of GBs into computing nodes. To face this issue, in this paper we propose a resource discovery framework which provides decision agents with the needed information for selecting the most suitable computing nodes. We have implemented this resource discovery function in a distributed fashion, and extensively tested it in a lab testbed consisting of about 70 nodes. We found that the overhead of the proposed solution is negligible in comparison with the amount of transferred data.
In recent years, the implementation of scientific computing platforms have evolved from the use local cluster-based computing, to the use of distributed grid computing and, more recently, of cloud computing infrastructure . This evolution is due to the improved flexibility in managing computing resources that this latest paradigm offers over the other ones. However, scientific computing is very different from typical cloud-based applications, ranging from hosting multimedia servers to deploying storage facilities. The main differences include the volume of data managed by scientific applications and/or CPU requirements, which could be of some orders of magnitude larger than in other types of applications. In addition, scientific computing applications are highly heterogeneous. For example, they may consist of platforms used to storage and process the output of high energy physics experiments, which need to be co-located with experimental facilities in order to timely collect the huge amount of data, or climate changes simulations, which needs high performance computing architectures, or genomic data sets processing, which may require more modest amounts of computing resources for a single execution, with a highly increasing number of executions to be implemented, with a size of the input files easily reaching tens of GBs.
In this paper, we focus on genomic data processing, which is rapidly gain momentum in research and medical activities, due to the reduction in the DNA sequencing costs . The main characteristics of these data sets is that not only the number of available genome files is becoming extremely large, but also each data set is significantly large, in the order of tens of GB. This problem is referred to as Big2 data problem, and its importance increases over time. In fact, it is expected that in the next few years all newborns will be sequenced and the medical science will build upon genome-processing outcomes. Clearly, it is not realistic to assume that each hospital will be able to acquire a large computing facility (private cloud) to cope with the internal processing demand. Thus, the use of public cloud-based processing services will be the obvious solution . Hence, it is evident that the suitable management of genomic data will be essential not only for storage services, but also for their processing and their transfer. In fact, a wide diffusion of cloud-based genomic data processing will have a significant impact on network resources, since each processing request will require the transfer of many GBs of data into computing nodes in data centers of cloud providers.
In this paper we propose a discovery protocol of cloud resources , by providing decision agents with a number of context information. This information may consist of position and availability of processing resources (CPU, RAM, storage), input data to be processed, auxiliary files (e.g. annotations ), and image files of the virtual machines (VMs) hosting the genomic software packages used to process genomics data. In fact, the types of genomic processing (implemented through the combination of different programs, referred to as genomic pipelines) that can be executed over a single genome is quite large; just to have an idea, the interested reader can find a non-exhaustive list reported in Table 1 of . Thus, it is not realistic to expect that each datacenter of a cloud provider can simultaneously host all VMs needed for any possible request.
Our discovery protocol leverages the functions offered by the Next Steps in Signaling (NSIS) framework . In more detail, we use the functions offered by the recently defined offpath extension to the NSIS protocol suite, initially presented in  and further refined in . This solution allows disseminating signaling over network areas of nearly arbitrary shape. By leveraging the interception capabilities of the signaling transport layer of NSIS, this signaling dissemination protocol is highly efficient, and may allow finding resources which are close to a given network path. In the considered scenario, the path under consideration could be, for instance, the one connecting the repository storing the needed VM images and the data center storing the input data. Since computing clusters in datacenters able to host VMs cannot be on the IP path connecting two servers (only routers are located on path), a signaling protocol with off-path discovery capabilities is used to discover data centers with both sufficient computing capabilities and a position suitable in order to minimize the overall network traffic exchanged. When this information is reported back to a decision agent, the latter can execute an optimization algorithm for selecting a datacenter.
Our solution can be classified as a peer-to-peer (P2P) resource discovery solution , with the feature that, being based on NSIS, it is able to couple with specific network paths. Classic P2P-based approaches  offer little or no support to provide proximity-based solutions, as stated in , trying just to limit the network overhead. Hierarchical solutions enable the efficient discovery of the resources belonging to one super-peer, but when more than one super-peers are involved, the problem is not trivial. Other solutions, recently proposed, make use of a proper ontology for service discovery , without considering path proximity issues. In , the authors propose an abstraction layer to discover the most appropriate infrastructure resources, which is then used in constraint-based approach applied to a multi-provider cloud environment. However, proximity issues are loosely handled, by simply referring to location requirements. Other solutions have been specifically designed for grid computing platforms, such as . In , routing hops are considered, but they are relevant to the grid overlay. In , again the effort is to minimize the amount of message on the grid overlay, and not to find nodes close to specific locations or paths.
We have implemented the proposed discovery solution in a lab testbed consisting of about 70 nodes, and have executed extensive tests. The obtained results prove that the network overhead of the proposed solution is negligible when compared with the size of data files to be exchanged.
The paper is organized as follows. Section II presents some background on NSIS, and the reference scenario, which is represented by the project ARES . Section III provides algorithmic and protocols details of our solution, and Section IV presents the results of lab experiments. Finally, we draw our concluding remarks in Section V.
II. BACKGROUND AND REFERENCE SCENARIO
A. Background on NSIS protocols and off-path extension
The NSIS protocol suite divides the signaling functions into two layers . The upper layer, called NSIS Signaling Layer Protocol (NSLP), implements the application signaling logic.