The local monitoring system of the CICC
Central Information and Computing Complex (CICC) of Joint Institute for Nuclear Research (JINR) in the Laboratory of Information Technologies (LIT) consists of a large computing cluster (computing slots and interactive machines, data storage systems and a number of control and special purpose servers). The complex is included into the WLCG (Worldwide LHC Computing Grid) infrastructure. Its resources were used during the constructing phase of the experiments at the LHC and now, at the running phase of the experiments, they are used for a mass event Monte-Carlo production, physics analysis and storage of data replicas of large volumes. The CICC LAN is based on aggregated GigabitEthernet-links (trunks), HP Procurve and Cisco Catalyst switches and routers.
Overseeing the infrastructure of this complexity and its maintenance require a centralized local monitoring that provides a clock monitoring of all resources, timely warning of failures and allows a comprehensive analysis of the complex. Qualitative work of such system is an important basis for the global grid monitoring, ensuring the correct operation of the site based on controlled infrastructure and providing relevant information on its work at higher levels of monitoring. Data provided by the service has great importance both for network administrators who are responsible for the equipment and channels, as well as for developers and users of the service grid.
Web interface of the system is accessible at http://litmon.jinr.ru. A special form at the home page allows you to find the desired object by a known network name or address.
In case of emergency the system sends the persons responsible for the problematic services related alerts via email or SMS. Data obtained as a result of the considered system of monitoring has repeatedly contributed to identifying, locating and troubleshooting CICC services, as well as optimizing its individual elements.