Page 97 - Building Big Data Applications
P. 97
92 Building Big Data Applications
compute on database platforms and we need to execute streaming analytics in memory
as data streams. The challenge here is that we will collect several terabytes of data from
source generated files, but need to provide 100 e200 GB new extracts for analytics, while
we will still have access to operational data for running analytics and exploration.
To process data the new platforms to add included Apache Hadoop, Apache Kafka,
Apache Spark, Apache Flume, Apache Impala, Oracle, and NoSQL database. This data
processing architecture will be integrated with the existing ecosystem of Oracle databases,
SAS, and Analytics systems. The Apache stack selected is shown in the picture below.
Hadoop configuration implemented at CERN includes the following:
Baer Metal Hadoop/YARN Clusters
five Clusters
110 þ nodes
14 þ PBs Storage
20 þ TB Memory
3100 þ Cores
HDDs and SDDs
Access to data is provided with Active Directory and native security rules are enclosed
for each layer of the access from the Grid to Hadoop. The rules provide encryption,
decryption, hierarchies, and granularity of access. The authorization policy is imple-
mented in the rules and the authentication is implemented as Active Directory.
The end user analysts and physicists at CERN use Jupyter notebooks with PySpark
implementation to work on all the data. The Jupyter notebooks use Impala, Pig, and
Python and several innovations have been added by the CERN team to use the Apache
stack for their specific requirements. We will discuss these innovations in the next
segment.
Innovations: