Improving input latency

GSoC summary #1 – July 29

The current implementation of X Window System relies in a signal scheme to manage the input event coming from hardware devices. This scheme frequently get blocked when lot of IO is occurring (for instance, when the process is swapping in/out). Get blocked means for instance a jumping cursor on the screen and in GUI is always desirable to prioritize the system responsiveness for end users. The human/computer interface should be smooth and this is the most user visible aspect of a system.

Besides the need for improvement in system responsiveness, the current design of the event stream has some oddities, probably due historical reasons, such as the cursor update done in user-space or the huge path that takes to draw the cursor instead just connect the mouse hardware directly with the cursor position update in-kernel. Moreover there is no fundamental reason to input drivers be dependent of DDX part of the X server. Therefore a design of the input subsystem must be carefully redone to improve such issues.

Our project try to solve all this problems. In summary the goal is: to get a path from hardware input event to client delivery that cannot be blocked by rendering or IO operations, meaning we always have very low latency on input events. Moreover, a redesign of such event stream could improve the overall X graphics stack, which must be considered as well.

So far three strategies were explored to achieve the goal:

1. put X input generation stage in a separate thread

2. put X input generation and processing stages others threads

3. shortcut the kernel input layer with drm to decrease the cursor update latency

Basically 1. and 2. tries to solve the issue of blocking signals and 3. would be a completely redesign in input infrastructure. Anyway, the 3. strategy would impact in 1. and 2. but these could be implemented in parallel with the third strategy. The following sections details each strategy.

== strategy #1 ==

Strategy 1 does not uses a signal handler anymore to wake up the event generation code. It simply poll for device’s socket and giving that this code is under a separate thread this is a win for the CPUs.

With the separate thread taking care only the input code, it was expected that the cursor footprint always lived on resident memory when the mouse stills in movement. Unfortunately this was not true. For some reason it swaps back to disk. Maybe some scheduler adjusts would help here. A memory lock scheme was tried to do lock the cursor footprint always in physical memory without success.

This strategy is basically what we’ve been done is the first GSoC. This is pretty much implemented. It would not require much trouble to push it to X server from upstream. The code is here:

== strategy #2 ==

This strategy can be thought as an improvement of #1. It can be separated in two models of implementation:

Model one:

thread #1 deals with
– injection and processing of input events
thread #2 deals with
– requests from known clients
– new client that tries to connect

It would be very very nice to let both threads totally independents. But we cannot. The event delivery depends on window structure and the first thread must always wake up the second. Also, sometimes the processing of events take a while and the injection of events stays stucked in this model. So we came with this another:

Model two:

thread #1 deals with
– injection of input events from devices
thread #2 deals with
– processing of input events to clients
thread #3 deals with
– requests from known clients
– new client that tries to connect

With this model the first and the second thread become not so tied and given that we’re using non blocking fds to wake up each thread (through a pipe), CPU “enjoys” the effect of threads. For instance, under heavy drawing primitives only thread #3 would wake up.

We had a proof-of-concept of this last model and it workish (occasionally seeing some segfaults probably due of some critical regions we forgot to lock – now the only mutex that exists is inside the server queue of events).

It’s hard to imagine other threaded models mainly because the way X deals with clients are very tied in every piece of the server and it would require a lot of mutexes.

== strategy #3 ==

For sure this strategy is the most shocking one :) The idea is to connect the mouse hardware directly to the cursor position update function, all inside kernel. We’d then rewrite the event stream from the pointer device to an absolute position. Transform the relative mouse motion into an absolute screen position seems to be not that complicated, but this strategy would involve acceleration and cursor limits inside kernel as well (the current implementation of accel deals with floats, so we would have to adapt it to live in kernel).

It is a _very_ _large_ amount of codification. It would require changes to the X server, DDX driver and its corresponding kernel DRM drivers, drm library and kernel input drivers. A mini-input driver *inside* drm is also needed. We would add complexities of the connection between input device and output device to the kernel (in my proof-of-concept implementation evdev is dependent of drm. Yeah, really weird world). Moreover, we would have to avoid somehow two differents sets of the exact same code in different contexts in the case of sw cursors (think MPX). It’s a completely redesign. Things would have to go incrementally.

But why this strategy? Well, this would solve all the current issues with input latency. For instance with the current design of the kernel modesetting – which seems the future – the cursor is jumping a lot, much more than with current implementation. Try to call a xrandr instance and move the mouse with kernel modesetting. xrandr will do DDC communication which will blocked X in the kernel. So with the handling and update of the cursor inside the kernel all would work fine (and my proof-of-concept already showed this).

Moreover, I believe the current implementation remained until now due historical reasons. Ultrix systems placed the entire input subsystem in the kernel. What is the problem to do this in Linux (and others) as well (besides massive codification)?

and non-dri drivers? Should we forget them?


cursor handling and updates inside DRM

The current DRM kernel modesetting tree is already taking care to update the cursor registers and paint it to the screen. Very cool [0].

What I’ve done today is a shortcut between the kernel input layer and DRM to update the cursor directly on screen without the X server be notified always. Of course, a lot of issues raised up together. So let’s try to delegates the tasks again.

userspace app (X server):
– starts all this mechanism telling which is the device responsible for the cursor (input ddx drv)
– responsible for loading new cursor images and push to the DRM (video ddx drv)

kernel input layer (evdev driver):
– notify and send its relative coordinates events to DRM

– transform relative motion into absolute
– takes care the cursor limits
– responsible for the acceleration computation
– responsible for the input transformation as well?
– touch the gfx registers.

Seems that a reasonable amount of code in ddx input drv (mainly ReadInput) and dix (mainly GetPointerEvents) would be “swallowed” by the DRM. The “event generation stage” of the server would deal with the event itself + xkb + Xi things (which eventually could be done in a dedicated thread) and will let to DRM the responsibility of paint the cursor on screen.

The communication between kernel input drv can be directly, calling a DRM function; the DRM and userspace can communicate basically using ioctls. Complains?

This would only works with DRM supporting OSes. What about the others?

[0]  Not so much. Seems this method to update the cursor is sending _a lot of_ ioctls and sometimes doing cursor jumps. But I have to double check to see if the problem is for sure with context switches.

fakemouse — a driver that emulates a mouse

For my SoC project I need some mechanism to evaluate the improvement of the input thread inside X. So I wrote a simple kernel driver that emulates the mouse device moving and emitting bits of a simple pattern. I don’t know if something like this already exists or if there are other ways to do it, but the fact is that the solution I thought took me only few hours between the moment that I imagined, collected some ideas on the Web and implemented it.

Why emulate a device? I need stress the X server always with same routines and things like XWarpPointer() and XTestFake*MotionEvent() is not close to a real user usage because they do not pass through all the paths of the event creation stage inside X. So now I can run fakemouse module together with some x11perf test and collect the results comparing the X with and without input thread. Cool :)

For those who are interested in the driver can do the following:
# wget
# tar xzvf fakemouse-0.01.tar.gz
# cd fakemouse-0.01
# make
# insmod fakemouse.ko
# echo 1 > /sys/module/fakemouse/parameters/mouseon

and be happy seeing what happens in some event node create by fakemouse (/dev/input/event*).