top of page

Big Data + Open Source + Inexpensive High Powered Parallel Processing + Grid Technologies = Driving the Best in Analytics to the Next Levels for the Best in Business Performance

Aggregation of Big Data From All Sources

With what was originally started in 2003 in order to index the web so that it can be searched through multi-machine processing, the Nutch project led by Doug Cutting included a distributed file system and a MapReduce facility. The framework papers on Google File System (GFS) and MapReduce were published in 2004 by Google. The GFS with fault-tolerant processing of large amounts of data can generate and also store tremendous amount of data. MapReduce, a programming model has the capacity to support parallel processing of huge amount of data.

 

Today, there is Apache Hadoop project which offered open-source software that is not only reliable, scalable, available but also allow for distributed computing, leveraging of cheap commodity hardware networked together that can be located in the same location, scale out and can read 2TB of data in 3.5 minutes. 

Huge Forms of Information Are Becoming Accessible - Awesome!!!

The availability of  high powered parallel processing, Hadoop cluster and grid technologies to capture, process, analyze, visualize massive amount of data which are growing daily at exponential rate; the access to scalable data storage solutions like cloud platforms; increased processing of data streams on the go; the advent of faster computing technologies with intensive capacities to crush the data, ability to extract robust information and make sense in visual presentation of the information from the data  are combining to make analytics attractive to the business leaders in making critical decisions to drive their respective organizations to the next level.

Value Proposition - Big Data

Haddop EcoSystem

Different Forms of Structured and Unstructured Data 

Type of Big and Small Data - Unstructured and Structured Data

Nature of Big Data: What Comes to Mind When We Talk About Big Data 

Beyond Relational Database Management Systems (RDBMS): Welcome to Hadoop Ecosystem Where Rapid Development of Additional Products is Exciting to Data and Analytics Enthusiasts

 

Hadoop Distributed File System (HDFS)

Initially, the best known products within the Hadoop Ecosystem were the Hadoop Distributed File System (HDFS) and MapReduce.

Hadoop Distributed File System (HDFS), based on the Google File System (GFS) can run on top of typical native filesystem.  It splits files into blocks stored as datanode but managed by namenode and allows for streaming data access patterns that entails reads and appending

 

 

MapReduce

MapReduce, a Google originated technology, with map and reduce fuctionality remains the workhorse that data and analytics enthusiasts really like because of the capacity to process large amount of data in parallel. Tasks can be localized and scheduled to be run at a specified time.

 

YARN

YARN came in as improvement to MapReduce in terms of being able to split JobTracker into two daemons and also elevating the efficiency in memory management.

 

Hadoop Column Database or Hbase

Today, there are the Hadoop Column Database or Hbase, open source, a NoSQL DB, written in Java, distributed to handle large tables, focuses on batch and random read even with limited queries, easily scalable but will run on a cluster of “cheap” hardware.

 

Apache Pig

At times when there are issues in scripting, Apache Pig  can be used to drive data flow scripting in multi-step jobs that easily translate into MapReduce tasks

 

Hive

There is Hive which remains the data warehousing infrastructure with ability to tap into HDFS and Hbase. With Hive query language (HQL) translated into MapReduce jobs, Hive executes the jobs on Hadoop Cluster.

 

Sqoop

Sqoop which helps with importing database.

 

Avro

Avro is the data serialization system.

 

Oozie

Oozie handles the workflowschedules and management.

 

Zookeeper

The facilitation or simply the “high-performance coordination” coordination service tool for distributed apps to be able to track servers is the Zookeeper. The Zookeeper is a very good addition to Hbase especially for region assignments.

 

Chukwa

Chukwa handles large-scale log of data collection while providing analysis and monitoring of collected data.

 

Given the fault-tolerance within the Hadoop ecosystem, there are capabilities to self-heal, self-start and self-manage in dealing with not only the data but also the computation loads. The environment allows for not only blocks with data to be replicated to multiple nodes, restart tasks on another node if taking too long but also does not lose data just because there are failures in the nodes!

The big users  of the Hadoop are easily identified as the producers of the big data and these include Yahoo, Facebook, Netflix, LinkedIn, CNET, NY Times.

 

Mahout

Within the Analytics domain, there are the machine learning libraries for Hadoop  that include Mahout. Mahout allows for Genetic algorithms, clustering, pattern recognition and collaborative filtering

 

Does it mean that with Hadoop Ecosystem, there is no need for Relational Database Management Systems (RDBMS) anymore? The answer is a resounding No!

bottom of page