Page 27 - Building Big Data Applications
P. 27
Chapter 2 Infrastructure and technology 21
With the features and capabilities discussed here, the limitations of distributed data
processing with relational databases are not a real barrier anymore. The new generation
architecture has created a scalable and extensible data processing environment for web
applications and has been adopted widely by companies that use web platforms. Over
the last decade many of these technologies have been committed back to open source
community for further development by innovators across the world (refer to Apache
foundation page for committers across projects). The new generation data processing
platforms including Hadoop, Hive, HBase, Cassandra, MongoDB, Neo4J, DynamoDB,
and more are all products of these exercises, which are discussed in this chapter.
There is a continuum of technology development in this direction (by the time we are
finished with this book, there will be newer developments, that can be found on the
website of this book).
Big data processing requirements
What is unique about big data processing? What makes it different or mandates new
thinking? To understand this better let us look at the underlying requirements. We can
classify big data requirements based on its characteristics
Volume
Size of data to be processed is large; it needs to be broken into manageable
chunks
Data needs to be processed in parallel across multiple systems
Data needs to be processed across several program modules simultaneously
Data needs to be processed once and processed to completion due to volumes
Data needs to be processed from any point of failure, since it is extremely large
to restart the process from beginning
Velocity
Data needs to be processed at streaming speeds during data collection
Data needs to be processed for multiple acquisition points
Variety
Data of different formats needs to be processed
Data of different types needs to be processed
Data of different structures need to be processed
Data from different regions need to be processed
Complexity
Big data complexity needs to use many algorithms to process data quickly and
efficiently
Several types of data need multi-pass processing and scalability is extremely
important