Page 98 - Building Big Data Applications
P. 98
Chapter 4 Scientific research applications and usage 93
XRootD filesystem interface project
The CERN team evaluated the Apache stack and identified a few gaps between where
they were with current technology and the new stack to be augmented. The gaps were all
physics files were written using the ROOT project and this project was developed in cþþ
and formats will not be able to load into AVRO or Spark. The CERN team joined hands
with DIANA-HEP team to create the XRootD project. The project was designed to load
physics files into HDFS and Spark. Details of the project can be found at http://xrootd.
org and the GitHub page for the project is at https://github.com/cerndb/hadoop-xrootd.
XRootD Project
XRootD: The XRootD project aims at giving high performance, scalable fault tolerant
access to data repositories of different kinds, and the access will be delivered as file
based. The project was conceived to be delivered on a scalable architecture,
a communication protocol, and a set of plug-ins and tools based on those. The freedom
to configure XRootD and to make it scale (for size and performance) allows the
deployment of data access clusters of virtually any size, which can include sophisticated
features, like authentication/authorization, integrations with other systems, and
distributed data distribution. XRootD software framework is a fully generic suite for fast,
low latency, and scalable data access, which can serve natively any kind of data,
organized as a hierarchical filesystem-like namespace, based on the concept of directory.
Service for web-based analysis (SWAN)
CERN has packaged and built a service layer for analysis based on the web browser. This
service called SWAN is a combination of the Jupyter notebook, Python, Cþþ, ROOT,
Java, Spark, and several other API interfaces. The package is available for download and
usage for any consumer who works with CERN. The SWAN service is available at https://
swan.web.cern.ch.
There are several other innovations to manage the large files, the streaming analytics,
the in-memory analytics, and kerberos security plug-ins.