OpenStack uses the Eventlet library. Eventlet offers a ‘standard’ threading interface to the python programmer while being implemented using asynchronous I/O underneath. The use of asynchronous I/O rather than a standard python threading or multi-process approach is aimed at reducing the memory footprint of network servers and can help an application reach high levels of concurrency.
In this post the use of Asynchronous I/O is discussed in the context of OpenStack in general and Swift in specific. In many ways this discussion is more general than to just Eventlet and OpenStack. It may be applicable for other asynchronous I/O frameworks and applications.
The main claim here is that like any technology, asynchronous I/O is ideal for certain applications but not suitable for others. If your application is memory bound, asynchronous I/O is probably going to help scale the application. But if the application is CPU bound or I/O bound it will not. Instead it may scale it down as discussed below.
Eventlet’s use of asynchronous I/O may help solve Dan Kegel’s presented C10K problem by reducing the memory footprint used to serve each request (See here, a well presented counter view, advocating the use of threading). At the same time, Eventlet offers a semi-standard pythonic threading interface which helps hide the involved complexity under the hood – making asynchronous I/O programming bearable and accessible to all.
So one good reason to use Eventlet is to reduce the memory footprint. If an application is memory bound, a reduced memory footprint would help scale it. As a prime example, a web server may serve many open sessions. Each session has an associated state. Most sessions may be idle at any given time(1). Some of the sessions are active, processing outstanding requests. Processing an outstanding request may require a certain amount of I/O and CPU prior to sending a response. If at a certain concurrency level, the server is neither I/O bound nor CPU bound, increasing the level of concurrency will yield performance improvement – i.e. more outstanding requests could be processed at any given time.
Bottom line, if an application is memory bound, Eventlet can help scale it up. But what if the application is I/O bound or CPU bound? Will asynchronous I/O scheme help? Can the use of Eventlet become a problem?
Is there a Right Concurrency Level for I/O bound Applications?
Let`s consider the Object Server of Swift as an example. It is likely that this server would be disk I/O bound. At a certain level of concurrency, the application will reach 100% I/O utilization. Increasing the concurrency level would not make our disks spin or seek faster. Instead, increasing the concurrency level further would trash our cache, page table and overall divide our resources thin, reducing the performance rather than increasing it.
Let’s assume an I/O bound application that can fully utilize its I/O with just 100 threads. What happens if 10,000 threads will open and read from 10,000 files descriptors of 10,000 files? Since we are I/O bound, the 10,000 read requests will be queued for execution. Eventlet’s deterministic scheduler will run any thread from the 10,000 as a chunk is ready to be read. So although most of the application threads are idle, by the time a thread is served, its related information would probably be cached out. More so, the memory pages used by the thread may have been paged out – so extra CPU and I/O will be next used to get the thread running. This means our overall performance could severely degrade.
Similar problem occurs with the underlying file system driver at the kernel and at the underlying disk driver. Both ill perform as we increase the application’s ‘indecisiveness’. File systems and disk drivers use their own caching algorithms, and perform best with somewhat predictable users. Reaching high concurrency levels, an application cannot take advantage of such algorithms.
The best approach for an I/O bound application is to limit the concurrency of the served requests while queuing the remaining requests. The queue serves the application to absorb demand burstiness while ensuring that the system is not trashed by trying to serve everyone at the same time.The application serializes the request handling with the hep of the queue.
Bottom line: An I/O bound application would perform best if the right level of concurrency is used – no more, no less. Finding this ‘right level of concurrency’ is system and application dependent and may be left to late tuning. Yet, the design should consider a reasonable default number. ‘The more the merrier’ approach does not apply here.
What about CPU bound applications?
A CPU bounded application using Eventlet will always have multiple threads waiting to be executed. Eventlet’s deterministic scheduler will serve each of the threads ready for execution in turn and without preemption. As a result:
- Longer requests are served faster while short requests take longer to complete. A system serving many short requests may delay any processing of such requests until a request taking longer to compute has completed and yielded.
- No thread is guaranteed CPU within any amount of time. Hence timeouts cannot be enforced in any reasonable level of accuracy. As the concurrency level increases, the time to serve a given thread increases. As an example, consider the thread that waits on a socket for an incoming request (or listens for an incoming connection). This thread must read the request (or accept the connection) within a given amount of time to avoid loosing data (the kernel queue is only that long). No guarantee can be made in a CPU bound system without preemption that the thread would indeed be served on time. As we increase the concurrency level, the problem becomes more severe.
Bottom line: A CPU bound application must either avoid Asynchronous I/O or be structured such that any group of green threads is processing requests of similar length and placing any time sensitive calculation in a separate thread/process to ensure it may enjoy the benefits of preemption.
What about OpenStack Servers?
Each OpenStack server may have different bounds. Eventlet can help in memory bound OpenStack servers but may reduce the scale and performance of other OpenStack servers. Further the use of Eventlet may be beneficial when running on one system and counter productive on another.
A server using threading requires tuning to determine the right default concurrency level of the server. The concurrency level should also be controllable by the admin as different systems would gain from different concurrency levels of the same server. Note that such tuning is missing today from OpenStack.
Q: What are the applicable default concurrency-levels for Nova, Swift and other OpenStack servers, such that they would become either CPU bound or I/O bound? Are OpenStack servers really memory bound? In other words, will using CPython threading really scale OpenStack down? Is the C10K problem applicable to OpenStack?
Multicore and Multiprocessing aspects
The standard CPython threading is implemented using the underlying Linux threading services such that every Python thread is a Linux thread. As a result, a single process, multi-threaded python executable can take advantage of multiple cores. That said, Python’s Global Interpreter Lock (GIL) would limit the ability of the process from becoming scalable with the number of cores. Q: What is a realistic parallelism level one can expect with Python threading considering the GIL?
Eventlet implements green threads (i.e. user-land threads). Hence, an Eventlet enabled executable runs on a single core and cannot take advantage of other cores. OpenStack is configurable to work with multiple workers, each forked as a separate process. Hence while using Eventlet on an N core machine, using N workers of any executable would help fully utilize the Node CPU. Note that using CPython threading may suffice with less workers.
Should OpenStack use multi-function executables?
What if an executable does several functions – as an example, one function that is memory bound, one function that is I/O bound and one function that is CPU bound. The memory bound function scale with increasing concurrency levels. The I/O bound best performs in a given level of concurrency. The CPU bound function may cause the other two functions to fail meet timeouts or respond on time to events. What a mess.
Bottom line: Separation of concerns is needed – avoid placing different functions with potentially different bounds in the same Asynchronous I/O domain. It seems like the existing coupling of every OpenStack request processor with a web front end may therefore be a source of trouble.
Asynchronous I/O can scale a memory bounded application but is not helpful if the bound is elsewhere. Specifically, if the application is I/O bound or CPU bound. Further, separation of concerns need to be maintained to ensure separate asynchronous I/O domains per function. I/O bound functions need to be limited to the right level of concurrency which may differ from one function to another and from one system to another.
The next steps…
- Collect initial inputs from the OpenStack community
- Collect information about the real bounds of different OpenStack request processors and the default concurrency level that should be used in case of CPU/IO bound request processor.
- Add support to tune concurrency levels
(1) Thanks to Nadav Harel for pointing this out.