Hadoop

2013. 1. 13. 16:46

http://hadoop.apache.org/

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.

The entire Apache Hadoop "platform" is now commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), as well as a number of related projects – including Apache Hive, Apache HBase, and others.

Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors. Hadoop and its related projects (Hive, HBase, Zookeeper, and so on) have many contributors from across the ecosystem. Though Java code is most common, any programming language can be used with "streaming" to implement the "map" and "reduce" parts of the system.

Flow Chart

Hadoop의 기본 정신은 빅데이터를 저장하거나 분석하는 것이 핵심입니다. 그런데 이러한 저장/분석 기능은 거의 실시간으로 제공되지 않습니다. 즉, 방대한 양의 연산을 매우 빠른 시간 안에 수행한다고해서 RDB처럼 실시간으로 입력받은 Data의 결과물을 제공해서 사용하는 것이 아닙니다.

분석을 위한 기본 흐름도는 "수집 → 저장_(NoSQL) → 분석 → 가공 → 분석 → 저장_(RDBMS) → 결과도출"의 순서를 하고 있습니다.

수집: Flume
저장: NoSQL(Cassandra, Hbase, MongoDB 등)
분석도구: MapReduce(Pig, Hive)
분석결과: R(일반 Client에게 쉽게 이해할 수 있도록 UI 제공 - 그래프 등), Sqoop(NoSQL → RDBMS)
관리/관제: Zookeeper, Chunkwa, Hue

Hadoop and its Related Projects

아래 그림에서 연두색은 Hadoop을 구성하는 부분이고 회색은 Related Project에 해당되는 부분입니다.

Ambari™	A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™	A data serialization system.
Cassandra™	A scalable multi-master database with no single points of failure.
Chukwa™	A data collection system for managing large distributed systems.
HBase™	A scalable, distributed database that supports structured data storage for large tables.
Hive™	A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™	A Scalable machine learning and data mining library.
Pig™	A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™	A high-performance coordination service for distributed applications.
Nutch™	An effort to build an open source search engine based on Lucene and Hadoop, also created by Doug Cutting.

Architecture

Consists of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS).

Hadoop Distributed File System (HDFS): 파일 저장 기술 (html, 이미지, 동영상, PDF, 로그 등)
- NameNode: metadata 보관 (이중화 문제 있음)
- DataNodes: 파일을 64MB 단위로 나누어 저장, default로 3개의 copy를 저장