Institut für Prozessdatenverarbeitung und Elektronik (IPE)
Progress in virtually all areas of physics research relies on recording and analyzing enormous amounts of data. This is equally true for the high energy physics at LHC, planned future lepton and neutrino detectors, as well as for experiments at high-intensity light sources such as the EU-XFEL or PETRA III. Recent improvements in detector instrumentation provide unprecedented detail to researchers. At the same time data rates far outpace the improvements in the performance of storage systems. Online data reduction is crucial for the next generation of detectors.
We aim to establish a closer integration of the data acquisition workflows with cloud-enabled HPC centers. The goal of this work is to build infrastructure to push data from the detector directly into the local HPC data center and rely on the HPC resources for data processing and reduction. The rapid advances in the Ethernet technology allow sufficient readout bandwidth, but efficient data distribution methods relaying on RDMA technologies are required to utilize network capacity efficiently. One of the challenges is to design an efficient protocol to facilitate communication between DAQ hardware and data processing cluster and to simplify development of scalable data reduction modules. As a pilot project, we aim to enable deployment of extremely complex Machine Learning models which can be executed across multiple nodes and accelerated using FPGAs, GPUs, or/and custom neuro-computers. We aim to enable real-time data reduction and classification of data streams with rates in the 10 - 20 GB/s range per detector (multi-detector systems are envisaged).
The student is expected to perform a subset of the following tasks:
Benchmark high-speed communication protocols, e.g. UDP, STCP, QUIC. Research available high-throughput alternatives to the standard Linux network stack, e.g. DPDK or LibVMA.
Latest Mellanox adapters allow offloading part of packet processing into the hardware. Investigate the provided features and suggest if they can be used to further increase network throughput.
Evaluate available RDMA extensions to deliver data directly to the computation accelerators like FPGAs or GPUs, e.g. RoCE or iWARP.
Design an application layer protocol integrating Ethernet-connected detectors with data-processing clusters. The protocol should include control channel for setting and reading detector parameters (registers) and high-speed data streaming channel.
Evaluate different methods to scale data flow across multiple cluster nodes. Assess scalability potential, fault tolerance, costs and simplicity of implementation.