Parallel events (panic) with X

Unfortunately that model which I described some weeks ago to put the input event delivery of the X server in a separate thread wouldn’t be an advantage. I precipitated myself thinking that it could be feasible. Sorry :(

I started to implement all this but it showed a very boring task to grab all the globals variables which both threads touch and to lock it. So I decided to stop going in this way. It’s hard to program thinking in parallel. It’s even harder to debug a program with severals flows. More, the tools don’t help you (if you have lucky, gdb will work).

But the main reason I can argue to stop with this model is that the “main” event flow of execution (i.e. basically all the functions in {Swapped,}ProcVector) and the input delivery flow (ProcessInputEvents()) are very very tied. Both deal a lot with clients and we’d need to lock several globals, thus spending a lot of time in the management of the threads. It’s easy to see this acting: just put a breakpoint in TryClientEvents(). Every single request to deliver a given event to a given client involves this function. And both input and main event flow will call TryClientEvents(). So you will see a zillion of times this function being called. The contention of the eventual processing and main threads would be even greater if the client choose to receive MotionNotify event.

So yeah, it’s far from be clear how to put processing of input events inside another thread.

== Next ==

In the next days I’ll be traveling to CESol, Fortaleza here in Brazil. I was invited to talk about my work in X land. Latin America has a lot of promising countries concerning FOSS development however for some reason no one actively participate and contribute for the X development (why?). I’ll try to motivate people there somehow :)

In the next week I’ll put the generation thread in a shape good enough to eventually push this to upstream. Also I’ll try to write a good sumary of all my work given that GSoC is in the end.

Priorities and scheduling hints for X server threads

Input events routed through another thread/process can have bad effects on latency because we can’t guarantee that it will get scheduled at the right moment. Although this is hard to see happening with the current X server threaded implementation, we must design something to avoid it. One way to improve the responsiveness is to give a high priority to the input thread and also adjust the CPU scheduling. (Note that this will not avoid problems related with page faults which usually happen in the X input flow.)

Linux uses 1:1 thread model and the scheduler handles every thread as a process. For now I don’t care about others systems. Both input generation and processing threads was designed to sleep after a relatively short CPU run. So we can give a priority to processes that are trusted to not hog the CPU. And given they are special time-critical applications I have no doubt in what policy to use: I set both input threads to use the real-time FIFO policy and to get the maximum priority (sched_get_priority_max()).

I’m sure that someone will complain telling that this would decrease a bit the main thread when used together with both input threads. In GUI we’re talking about better user experience. Latency variability must be avoided whenever possible in interactive situations. What the user see is what matters. For non-interactive processes (server scheduling workloads) the situation is totally different.

Xorg’s philosophy is to be portable so we have to take care when setting this kind of parameters. It is a complex issue and different systems do it in wildly different ways. I was using my Linux box (2.6.24) to design it all.

Improving input latency

GSoC summary #1 – July 29

The current implementation of X Window System relies in a signal scheme to manage the input event coming from hardware devices. This scheme frequently get blocked when lot of IO is occurring (for instance, when the process is swapping in/out). Get blocked means for instance a jumping cursor on the screen and in GUI is always desirable to prioritize the system responsiveness for end users. The human/computer interface should be smooth and this is the most user visible aspect of a system.

Besides the need for improvement in system responsiveness, the current design of the event stream has some oddities, probably due historical reasons, such as the cursor update done in user-space or the huge path that takes to draw the cursor instead just connect the mouse hardware directly with the cursor position update in-kernel. Moreover there is no fundamental reason to input drivers be dependent of DDX part of the X server. Therefore a design of the input subsystem must be carefully redone to improve such issues.

Our project try to solve all this problems. In summary the goal is: to get a path from hardware input event to client delivery that cannot be blocked by rendering or IO operations, meaning we always have very low latency on input events. Moreover, a redesign of such event stream could improve the overall X graphics stack, which must be considered as well.

So far three strategies were explored to achieve the goal:

1. put X input generation stage in a separate thread

2. put X input generation and processing stages others threads

3. shortcut the kernel input layer with drm to decrease the cursor update latency

Basically 1. and 2. tries to solve the issue of blocking signals and 3. would be a completely redesign in input infrastructure. Anyway, the 3. strategy would impact in 1. and 2. but these could be implemented in parallel with the third strategy. The following sections details each strategy.

== strategy #1 ==

Strategy 1 does not uses a signal handler anymore to wake up the event generation code. It simply poll for device’s socket and giving that this code is under a separate thread this is a win for the CPUs.

With the separate thread taking care only the input code, it was expected that the cursor footprint always lived on resident memory when the mouse stills in movement. Unfortunately this was not true. For some reason it swaps back to disk. Maybe some scheduler adjusts would help here. A memory lock scheme was tried to do lock the cursor footprint always in physical memory without success.

This strategy is basically what we’ve been done is the first GSoC. This is pretty much implemented. It would not require much trouble to push it to X server from upstream. The code is here:

== strategy #2 ==

This strategy can be thought as an improvement of #1. It can be separated in two models of implementation:

Model one:

thread #1 deals with
– injection and processing of input events
thread #2 deals with
– requests from known clients
– new client that tries to connect

It would be very very nice to let both threads totally independents. But we cannot. The event delivery depends on window structure and the first thread must always wake up the second. Also, sometimes the processing of events take a while and the injection of events stays stucked in this model. So we came with this another:

Model two:

thread #1 deals with
– injection of input events from devices
thread #2 deals with
– processing of input events to clients
thread #3 deals with
– requests from known clients
– new client that tries to connect

With this model the first and the second thread become not so tied and given that we’re using non blocking fds to wake up each thread (through a pipe), CPU “enjoys” the effect of threads. For instance, under heavy drawing primitives only thread #3 would wake up.

We had a proof-of-concept of this last model and it workish (occasionally seeing some segfaults probably due of some critical regions we forgot to lock – now the only mutex that exists is inside the server queue of events).

It’s hard to imagine other threaded models mainly because the way X deals with clients are very tied in every piece of the server and it would require a lot of mutexes.

== strategy #3 ==

For sure this strategy is the most shocking one :) The idea is to connect the mouse hardware directly to the cursor position update function, all inside kernel. We’d then rewrite the event stream from the pointer device to an absolute position. Transform the relative mouse motion into an absolute screen position seems to be not that complicated, but this strategy would involve acceleration and cursor limits inside kernel as well (the current implementation of accel deals with floats, so we would have to adapt it to live in kernel).

It is a _very_ _large_ amount of codification. It would require changes to the X server, DDX driver and its corresponding kernel DRM drivers, drm library and kernel input drivers. A mini-input driver *inside* drm is also needed. We would add complexities of the connection between input device and output device to the kernel (in my proof-of-concept implementation evdev is dependent of drm. Yeah, really weird world). Moreover, we would have to avoid somehow two differents sets of the exact same code in different contexts in the case of sw cursors (think MPX). It’s a completely redesign. Things would have to go incrementally.

But why this strategy? Well, this would solve all the current issues with input latency. For instance with the current design of the kernel modesetting – which seems the future – the cursor is jumping a lot, much more than with current implementation. Try to call a xrandr instance and move the mouse with kernel modesetting. xrandr will do DDC communication which will blocked X in the kernel. So with the handling and update of the cursor inside the kernel all would work fine (and my proof-of-concept already showed this).

Moreover, I believe the current implementation remained until now due historical reasons. Ultrix systems placed the entire input subsystem in the kernel. What is the problem to do this in Linux (and others) as well (besides massive codification)?

and non-dri drivers? Should we forget them?


fakemouse — a driver that emulates a mouse

For my SoC project I need some mechanism to evaluate the improvement of the input thread inside X. So I wrote a simple kernel driver that emulates the mouse device moving and emitting bits of a simple pattern. I don’t know if something like this already exists or if there are other ways to do it, but the fact is that the solution I thought took me only few hours between the moment that I imagined, collected some ideas on the Web and implemented it.

Why emulate a device? I need stress the X server always with same routines and things like XWarpPointer() and XTestFake*MotionEvent() is not close to a real user usage because they do not pass through all the paths of the event creation stage inside X. So now I can run fakemouse module together with some x11perf test and collect the results comparing the X with and without input thread. Cool :)

For those who are interested in the driver can do the following:
# wget
# tar xzvf fakemouse-0.01.tar.gz
# cd fakemouse-0.01
# make
# insmod fakemouse.ko
# echo 1 > /sys/module/fakemouse/parameters/mouseon

and be happy seeing what happens in some event node create by fakemouse (/dev/input/event*).

Google Summer of Code 2008

I’m very happy to say that I was selected again to work on Google Summer of Code with X.Org Foundation. Daniel will be my mentor again. Thanks Google. Thanks X.Org!

In the last year we did a nice work separating the input event generation code of the X server into a different thread. We saw some performance improvement there specially because the implementation is not using signals anymore to wake up the server when some device emits an event. The reason why is that when a process is in the uninterruptible sleep (D state) signals are delayed and the mouse cursor lags.

The idea now is to continue the work and put the event processing stage in the separate thread as well. This will result in a lot of structures locks and will be very challenger. I’ll be posting all my advances here.