PERFORMING INITIATIVE DATA PREFETCHING IN DISTRIBUTED FILE SYSTEMS FOR CLOUD COMPUTING

ABSTRACT:
An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.
Finally, the prefetched data can be pushed to the relevant client machine from the storage server. Through a series of evaluation experiments with a collection of application benchmarks, we have demonstrated that our presented initiative prefetching technique can benefit distributed file systems for cloud environments to achieve better I/O performance. In particular, configurationlimited client machines in the cloud are not responsible for predicting I/O access operations, which can definitely contribute to preferable system performance on them.
 
INTRODUCTION
The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.
However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.
On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.
Furthermore, considering only disk I/O events can reveal the disk tracks that can offer critical information to perform I/O optimization tactics certain prefetching techniques have been proposed in succession to read the data on the disk in advance after analyzing disk I/O traces. But, this kind of prefetching only works for local file systems, and the prefetched data iscached on the local machine to fulfill the application’s I/O requests passively in brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers.
 
LITRATURE SURVEY
PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS
AUTHOR: J. Liao, Y. Ishikawa
PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.
EXPLANATION:
This paper presents PARTE, a prototype parallel file system with active/standby configured metadata servers (MDSs). PARTE replicates and distributes a part of files’ metadata to the corresponding metadata stripes on the storage servers (OSTs) with a per-file granularity, meanwhile the client file system (client) keeps certain sent metadata requests. If the active MDS has crashed for some reason, these client backup requests will be replayed by the standby MDS to restore the lost metadata. In case one or more backup requests are lost due to network problems or dead clients, the latest metadata saved in the associated metadata stripes will be used to construct consistent and up-to-date metadata on the standby MDS. Moreover, the clients and OSTs can work in both normal mode and recovery mode in the PARTE file system. This differs from conventional active/standby configured MDSs parallel file systems, which hang all I/O requests and metadata requests during restoration of the lost metadata. In the PARTE file system, previously connected clients can continue to perform I/O operations and relevant metadata operations, because OSTs work as temporary MDSs during that period by using the replicated metadata in the relevant metadata stripes. Through examination of experimental results, we show the feasibility of the main ideas presented in this paper for providing high availability metadata service with only a slight overhead effect on I/O performance. Furthermore, since previously connected clients are never hanged during metadata recovery, in contrast to conventional systems, a better overall I/O data throughput can be achieved with PARTE.

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS
AUTHOR: P. Sehgal, V. Tarasov, E. Zadok
PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.
EXPLANATION:
Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS 
AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al
PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.
EXPLANATION:
WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

 
SYSTEM ANALYSIS
EXISTING SYSTEM:
The file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of data intensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications benchmark to create OLTP workloads, since it is able to create similar OLTP workloads that exist in real systems. All the configured client file systems executed the same script, and each of them run several threads that issue OLTP requests. Because Sysbench requires MySQL installed as a backend for OLTP workloads, we configured mysqld process to 16 cores of storage servers. As a consequence, it is possible to measure the response time to the client request while handling the generated workloads.
DISADVANTAGES:

  • Network delay in numerically and geographically remote file system access
  • Mobile devices generally have limited processing power, battery life and storage

PROPOSED SYSTEM: 
Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages. 
This paper makes the following two contributions:
1) Chaotic time series prediction and linear regression prediction to forecast disk I/O access. We have modeled the disk I/O access operations, and classified them into two kinds of access patterns, i.e. the random access pattern and the sequential access pattern. Therefore, in order to predict the future I/O access that belongs to the different access patterns as accurately as possible (note that the future I/O access indicates what data will be requested in the near future), two prediction algorithms including the chaotic time series prediction algorithm and the linear regression prediction algorithm have been proposed respectively. 2) Initiative data prefetching on storage servers. Without any intervention from client file systems except for piggybacking their information onto relevant I/O requests to the storage servers. The storage servers are supposed to log disk I/O access and classify access patterns after modeling disk I/O events. Next, by properly using two proposed prediction algorithms, the storage servers can predict the future disk I/O access to guide prefetching data. Finally, the storage servers proactively forward the prefetched data to the relevant client file systems for satisfying future application’s requests.

ADVANTAGES:

  • The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.
  • Less file operations via batched I/O requests through prefetching
  • Cloud computing offers an illusion of infinite computing resources

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

JAVA

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Script :           Java Script
  • Document :           MS-Office 2007

Categories