Big Data on OpenStack


OpenStack provides a management framework or a suite of tools to deal with the management tasks on various kinds of physical resources (such as computing or networking resources, storage, and VM images) in data centers. It pursues to not only improve the utilization of physical resources but also makes the tasks of resource management and provision easier and more convenient. It can work on the top of many currently popular virtualization hypervisors like KVM, XEN, which provide the virtualization environment on a single host. Simply speaking, OpenStack is a centralized resource management tool in data centers.

MapReduce/Hadoop is a framework to provide the capacity of efficiently processing a huge number of unstructured data in data centers. In MapReduce, a request/job is separated into multiple tasks that will be processed on distributed hosts, such that shorter completion time can be achieved. Therefore an effective way to process big data in data centers is deploying MapReduce as an application layer on the OpenStack. Exploiting the resources management capacity provided by OpenStack, one can efficiently process those huge data in a manner of keeping high utilization rate of physical resources.

However, different from general database tools, like MySQL, Orcale SQL, MapReduce is just a processing framework rather than a data depository, it hardly provide the query or random query functions, which are a very common operation in currently common database. Therefore, in order to provide users/developers more convenient methods to access/query their wanted info out from huge data, a SQL compatible tool is needed to simplify the query operations. Hive is the tool. It works on the top of MapReduce, and is responsible to translate the SQL querying commands to a processing job to be processed with MapRedue. Then the results will be return to the users after the job is completed. Also Dremel, a similar tool to Hive and developed by Google, is also an alternative choice. It provides more powerful random query capacity. With the carefully designed structure of data storage and the parallel processing, it can return the results within few seconds out for PB level data.

The tools described above are mainly provided for developers. User-side tools/interfaces can be done by simply building a software tool with GUI. Then users can easily query/access their wanted info out from huge data with the common SQL commands. And the query tasks will be automatically translated as a processing job of MapReduce.

Reference:
[1]. Hadoop: 你不得不了解的大数据工具 http://blog.jobbole.com/13538/
[2]. 用Hadoop, 还是不用Hadoop?  http://blog.jobbole.com/49470/

评论

此博客中的热门博文

在Ubuntu 16.04上基于OpenBLAS 编译LAPACK

kindle 4(黑)去广告+换屏保

COIN-OR 源码编译安装要点记录