Skip to content

Scalability Issues

Chris Dunlap edited this page Aug 25, 2023 · 7 revisions

ConMan will need to support increased scalability for upcoming clusters. Current scalability is limited by the following:

  • Use of poll() in the mux_io() event loop (see Reactor pattern). This should be replaced with epoll(); but since that is Linux-specific, investigate event-notification libraries such as libevent, libuv, libev, and libeio.

  • Use of a singly-linked list to manage console objects. While this was acceptable with select() and later poll(), it will limit any performance gains from replacing poll(). This also negatively impacts performance at startup when checking for duplicate console names during object creation. The objects list should be replaced with a tree-based data structure providing efficient random access.

  • Use of a sorted singly-linked list to manage timers resulting in O(n) for insertions and deletions (although dispatch is only O(1)). This should be replaced with a heap data structure which would have O(log n) for insertion, deletion, and dispatch. Another possibility would be hashed timing wheels [Varghese and Lauck 1996] which can be as efficient as O(1) for insertion, deletion, and dispatch.

  • Use of Expect to support SSH connections. Each console connected in this manner requires an additional two processes: one for Expect and another for ssh. While previous testing on MCR showed acceptable performance for ~1280 consoles using Expect to drive a telnet process, increasing process counts are expected to impact scalability (or at the very least, clutter the ps listing). Investigate libssh and libssh2.

  • Inability to add/remove/edit consoles without restarting conmand (see #13). Starting the daemon causes a burst of network activity as connections are established. This is problematic when managing a large number of consoles since CPU cycles are wasted traversing the console list to process poll() events. Furthermore, requiring the daemon to be restarted in order to add/remove/edit consoles will likely result in console messages being dropped while connections are re-established.

  • IPv4 only. While the use of IPv4-only connections is not expected to be directly impacting performance, the lack of IPv6 support limits usability by sites requiring IPv6 addressing.

  • Outdated client/server protocol. The current protocol is largely unchanged since its inception. It is not particularly efficient. Queries for a list of matching consoles are limited by a maximum buffer size (currently 128KB; see MAX_SOCK_LINE). Furthermore, the protocol is not encrypted and not easily extended.

  • Single-threaded event loop. mux_io() is a single-threaded non-blocking I/O event loop. As such, it is largely unable to take advantage of increasing core/processor counts. However, consoles using FreeIPMI are able to benefit from increasing parallelism since libipmiconsole manages its own thread pool. Multi-threaded support may be necessary for increased scalability.

  • Centralized daemon. Each conmand process is a standalone server managing all client connections. A decentralized or hierarchical model may be necessary for increased scalability.

  • Fixed-sized static console object buffer (currently 16KB; see OBJ_BUF_SIZE). As the number of consoles increases, it would be advantageous to use a dynamically-sized buffer to reduce overall memory usage, or at least allow the buffer size to be specified in the configuration file.

  • Much of the event processing is based on the underlying object type and controlled by if or case statements. A small performance gain might be achieved by switching to function pointers for common object operations. This would also make the code more understandable and maintainable.

Clone this wiki locally