Page 27 - Building Big Data Applications
P. 27

Chapter 2   Infrastructure and technology  21


                   With the features and capabilities discussed here, the limitations of distributed data
                 processing with relational databases are not a real barrier anymore. The new generation
                 architecture has created a scalable and extensible data processing environment for web
                 applications and has been adopted widely by companies that use web platforms. Over
                 the last decade many of these technologies have been committed back to open source
                 community for further development by innovators across the world (refer to Apache
                 foundation page for committers across projects). The new generation data processing
                 platforms including Hadoop, Hive, HBase, Cassandra, MongoDB, Neo4J, DynamoDB,
                 and more are all products of these exercises, which are discussed in this chapter.
                   There is a continuum of technology development in this direction (by the time we are
                 finished with this book, there will be newer developments, that can be found on the
                 website of this book).


                 Big data processing requirements

                 What is unique about big data processing? What makes it different or mandates new
                 thinking? To understand this better let us look at the underlying requirements. We can
                 classify big data requirements based on its characteristics

                   Volume
                     Size of data to be processed is large; it needs to be broken into manageable
                      chunks
                     Data needs to be processed in parallel across multiple systems
                     Data needs to be processed across several program modules simultaneously
                     Data needs to be processed once and processed to completion due to volumes
                     Data needs to be processed from any point of failure, since it is extremely large
                      to restart the process from beginning
                   Velocity
                     Data needs to be processed at streaming speeds during data collection
                     Data needs to be processed for multiple acquisition points
                   Variety
                     Data of different formats needs to be processed
                     Data of different types needs to be processed
                     Data of different structures need to be processed
                     Data from different regions need to be processed
                   Complexity
                     Big data complexity needs to use many algorithms to process data quickly and
                      efficiently
                     Several types of data need multi-pass processing and scalability is extremely
                      important
   22   23   24   25   26   27   28   29   30   31   32