Big Data on OpenStack
OpenStack
provides a management framework or a suite of tools to deal with the management
tasks on various kinds of physical resources (such as computing or networking
resources, storage, and VM images) in data centers. It pursues to not only improve
the utilization of physical resources but also makes the tasks of resource management
and provision easier and more convenient. It can work on the top of many currently
popular virtualization hypervisors like KVM, XEN, which provide the
virtualization environment on a single host. Simply speaking, OpenStack is a centralized
resource management tool in data centers.
MapReduce/Hadoop is a framework to provide the
capacity of efficiently processing a huge number of unstructured data in data
centers. In MapReduce, a request/job is separated into multiple tasks that will be
processed on distributed hosts, such that shorter completion time can be
achieved. Therefore an effective way to process big data in data centers is deploying
MapReduce as an application layer on the OpenStack. Exploiting the resources
management capacity provided by OpenStack, one can efficiently process those
huge data in a manner of keeping high utilization rate of physical resources.
However, different from general database tools, like
MySQL, Orcale SQL, MapReduce is just a processing framework rather than a data
depository, it hardly provide the query or random query functions, which are a
very common operation in currently common database. Therefore, in order to
provide users/developers more convenient methods to access/query their wanted
info out from huge data, a SQL compatible tool is needed to simplify the query
operations. Hive is the tool. It works on the top of MapReduce, and is
responsible to translate the SQL querying commands to a processing job to be
processed with MapRedue. Then the results will be return to the users after the
job is completed. Also Dremel, a similar tool to Hive and developed by
Google, is also an alternative choice. It provides more powerful random query
capacity. With the carefully designed structure of data storage and the
parallel processing, it can return the results within few seconds out for PB
level data.
The tools described above are mainly provided for
developers. User-side tools/interfaces can be done by simply building a
software tool with GUI. Then users can easily query/access their wanted info
out from huge data with the common SQL commands. And the query tasks will be
automatically translated as a processing job of MapReduce.
Reference:
[1]. Hadoop: 你不得不了解的大数据工具 http://blog.jobbole.com/13538/
[2]. 用Hadoop, 还是不用Hadoop? http://blog.jobbole.com/49470/
评论
发表评论