Chromium Ozone-GBM explained

I’ve wrote an article about the new graphics platform for Chromium called Ozone-GBM. I particularly think that Ozone-GBM will play an important role next in Chromium and Linux graphics communities in general.  I hope you enjoy the read :) Please share it.

Linux Graphics for Small Devices at FISL

Last week I’ve been in Brazil at 11th International Free Software Forum (FISL) talking about Linux Graphics for Small Devices*. I tried to cover a bit of everything that I learned in the world I’ve been immersed in some near past – I guess there aren’t many news for freedesktopers though. Anyway, everyone is very welcome to give any kind of feedback and comment on it. Just follow here.

*actually, two nights in Porto Alegre and two nights in Curitiba. Was great to see most of my friends!

adopt a child and make multi-card work on Linux

Previously, the message was for toolkit, now it targets new upcoming developers… okay, if I’d be offensive I could say it targets vendor distributions which care for desktop on Linux :)

I have started hacking on X due the laboratory at my university I was working was running an amazing project to employ computer labs in all high-schools of the state I was living, in Brazil. It was a successful and all 2.100 schools used the multiseat computing model.

The beginning of my work in this project happened back in 2006 [0], and on that time I was trying to understand the situation that Linux using multiple graphics cards was living – that is only part of the needed work for making multiseat. The work proceeded but I never could actually push the patches to the mainline. Afterwards, and now at Nokia, I took this work again targeting some clean-up on X server code. It mostly went upstream (see VGA arbiter, libpciaccess and current xserver code). But the code is buggy and lot of work still needs to make it work properly.

Seems that I have a son now, but he (or should be she?) is a rebel baby and generates lot of trouble. Rather, I’m mean and want to give he away!

I don’t care about multi-card development nowadays and for an unknown reason no one also cares. But people use a lot: try to mix old graphics cards with new cards…. boom! Try to use multi-card with decent hw acceleration… boom! Try to hotplug graphics devices… no way! Hotswitch… hardly! Perform close to a single-card system… only in your dream! Some guys are kindly contributing sending patches for a while and unfortunately our open-source community are lacking man-power to make it get reviewed properly and eventually land at upstream. So here’s your big chance:


[0] BTW, I found the first patch I sent for X. It dates back in April 2006 and was against Xgl, GLX backend. Very funny :)

multiseat – roadmap

This week our laboratory at university released the MDM utility to ease the process of installation and configuration of a multiseat box. The idea is that the end-user should not use some boring and hard howtos anymore to deploy it. Just installing a distro package must be enough now. Try it, use it, report the bugs and send the patches! :)

But I would like to call attention here because we’re still relying on the ugly Xephyr solution to build the multiseat on a simple PC machine (there are people selling this solution. Sigh). A lot of cool stuffs arriving in the linux graphics stack are lacking with this solution. So lets try trace the roadmap here that we can follow in the short/medium-term to build a good one solution:

– Vga Arbiter
We should as fast as we can forget the Xephyr hack. Definitely several instances of Xorg – one for each user session – is what we want. If someone wants to use several graphics cards to deploy a multiseat, then (s)he would probably face the problem of the vga legacy address. The vga arbitration is the solution.

Jesse seems that will work towards this in 2.6.28. There’s also a “minor” problem here that the X server still not posting secondary cards (after pci-rework).

– xrandr 1.3
To deploy a multiseat with one video card/multi-crtc, the randr extension is enough to cover the hotplug of output devices. For a multi-card configuration, xrandr must be GPU aware. So we must done some work here as well to do the automagically configuration of output devices.

– input hotplug
So far MDM is not using the last input features of X to plug or re-plug a device in the machine. It is using its own ugly method to poll for devices. Some work here is needed.

– ConsoleKit integration
Device ownership (e.g. audio, pen drive, cameras, usb toys, output and input devices) when multiple users are in the same machine could be a mess. Moreover, the security problems implied by this could be harmful. ConsoleKit seems to solve this all. It is able to keep track of all the users currently logged in.

Honestly I never took a closer look at ConsoleKit. I’m not sure if it’s prepared enough to support multiseat. So we need to take care of this as well eventually putting some hook inside MDM to work with it.

– DRI + modesetting
Support DRI in multiple X servers in parallel is not ready yet. Some redesign must be done to achieve this.

– tools for auto configuration
After all this work, some easy tools and very user-friendly would be awesome to setup on-the-fly the multiseat in the desktop environments.

Improving input latency

GSoC summary #1 – July 29

The current implementation of X Window System relies in a signal scheme to manage the input event coming from hardware devices. This scheme frequently get blocked when lot of IO is occurring (for instance, when the process is swapping in/out). Get blocked means for instance a jumping cursor on the screen and in GUI is always desirable to prioritize the system responsiveness for end users. The human/computer interface should be smooth and this is the most user visible aspect of a system.

Besides the need for improvement in system responsiveness, the current design of the event stream has some oddities, probably due historical reasons, such as the cursor update done in user-space or the huge path that takes to draw the cursor instead just connect the mouse hardware directly with the cursor position update in-kernel. Moreover there is no fundamental reason to input drivers be dependent of DDX part of the X server. Therefore a design of the input subsystem must be carefully redone to improve such issues.

Our project try to solve all this problems. In summary the goal is: to get a path from hardware input event to client delivery that cannot be blocked by rendering or IO operations, meaning we always have very low latency on input events. Moreover, a redesign of such event stream could improve the overall X graphics stack, which must be considered as well.

So far three strategies were explored to achieve the goal:

1. put X input generation stage in a separate thread

2. put X input generation and processing stages others threads

3. shortcut the kernel input layer with drm to decrease the cursor update latency

Basically 1. and 2. tries to solve the issue of blocking signals and 3. would be a completely redesign in input infrastructure. Anyway, the 3. strategy would impact in 1. and 2. but these could be implemented in parallel with the third strategy. The following sections details each strategy.

== strategy #1 ==

Strategy 1 does not uses a signal handler anymore to wake up the event generation code. It simply poll for device’s socket and giving that this code is under a separate thread this is a win for the CPUs.

With the separate thread taking care only the input code, it was expected that the cursor footprint always lived on resident memory when the mouse stills in movement. Unfortunately this was not true. For some reason it swaps back to disk. Maybe some scheduler adjusts would help here. A memory lock scheme was tried to do lock the cursor footprint always in physical memory without success.

This strategy is basically what we’ve been done is the first GSoC. This is pretty much implemented. It would not require much trouble to push it to X server from upstream. The code is here:

== strategy #2 ==

This strategy can be thought as an improvement of #1. It can be separated in two models of implementation:

Model one:

thread #1 deals with
– injection and processing of input events
thread #2 deals with
– requests from known clients
– new client that tries to connect

It would be very very nice to let both threads totally independents. But we cannot. The event delivery depends on window structure and the first thread must always wake up the second. Also, sometimes the processing of events take a while and the injection of events stays stucked in this model. So we came with this another:

Model two:

thread #1 deals with
– injection of input events from devices
thread #2 deals with
– processing of input events to clients
thread #3 deals with
– requests from known clients
– new client that tries to connect

With this model the first and the second thread become not so tied and given that we’re using non blocking fds to wake up each thread (through a pipe), CPU “enjoys” the effect of threads. For instance, under heavy drawing primitives only thread #3 would wake up.

We had a proof-of-concept of this last model and it workish (occasionally seeing some segfaults probably due of some critical regions we forgot to lock – now the only mutex that exists is inside the server queue of events).

It’s hard to imagine other threaded models mainly because the way X deals with clients are very tied in every piece of the server and it would require a lot of mutexes.

== strategy #3 ==

For sure this strategy is the most shocking one :) The idea is to connect the mouse hardware directly to the cursor position update function, all inside kernel. We’d then rewrite the event stream from the pointer device to an absolute position. Transform the relative mouse motion into an absolute screen position seems to be not that complicated, but this strategy would involve acceleration and cursor limits inside kernel as well (the current implementation of accel deals with floats, so we would have to adapt it to live in kernel).

It is a _very_ _large_ amount of codification. It would require changes to the X server, DDX driver and its corresponding kernel DRM drivers, drm library and kernel input drivers. A mini-input driver *inside* drm is also needed. We would add complexities of the connection between input device and output device to the kernel (in my proof-of-concept implementation evdev is dependent of drm. Yeah, really weird world). Moreover, we would have to avoid somehow two differents sets of the exact same code in different contexts in the case of sw cursors (think MPX). It’s a completely redesign. Things would have to go incrementally.

But why this strategy? Well, this would solve all the current issues with input latency. For instance with the current design of the kernel modesetting – which seems the future – the cursor is jumping a lot, much more than with current implementation. Try to call a xrandr instance and move the mouse with kernel modesetting. xrandr will do DDC communication which will blocked X in the kernel. So with the handling and update of the cursor inside the kernel all would work fine (and my proof-of-concept already showed this).

Moreover, I believe the current implementation remained until now due historical reasons. Ultrix systems placed the entire input subsystem in the kernel. What is the problem to do this in Linux (and others) as well (besides massive codification)?

and non-dri drivers? Should we forget them?


fakemouse — a driver that emulates a mouse

For my SoC project I need some mechanism to evaluate the improvement of the input thread inside X. So I wrote a simple kernel driver that emulates the mouse device moving and emitting bits of a simple pattern. I don’t know if something like this already exists or if there are other ways to do it, but the fact is that the solution I thought took me only few hours between the moment that I imagined, collected some ideas on the Web and implemented it.

Why emulate a device? I need stress the X server always with same routines and things like XWarpPointer() and XTestFake*MotionEvent() is not close to a real user usage because they do not pass through all the paths of the event creation stage inside X. So now I can run fakemouse module together with some x11perf test and collect the results comparing the X with and without input thread. Cool :)

For those who are interested in the driver can do the following:
# wget
# tar xzvf fakemouse-0.01.tar.gz
# cd fakemouse-0.01
# make
# insmod fakemouse.ko
# echo 1 > /sys/module/fakemouse/parameters/mouseon

and be happy seeing what happens in some event node create by fakemouse (/dev/input/event*).

Benchmarking it all

After a long journey I come back in this… So I did a set of benchmarks to evaluate the VGA arbitration versus the RAC usage. My goal is to evaluate the performance difference of a multi-head/multi-card environment, i.e., an Xorg using the RAC to another using the arbitration.

The experiments consisted of two applications running at the same time in each Xorg server, one at each screen. This is interesting because it stress the semaphore task of the arbiter inside kernel, creating race conditions between the screens. The experiments were performed ten times and the average result was picked.

In the first experiment a common operation to fill solid rectangles (x11perf -rect500) was started in each screen simultaneously. The X server using RAC obtained 3395 rectangles per second on screen one and 3400 on screen two. OTOH, the VGA arbiter obtained 3385 and 3400 rectangles respectively.

The second experiment showed a “close to real” usage of the VGA interface arbitration with Kobo Deluxe game :) The X server using RAC shows an average of 162.86 FPS on screen one against 163.91 FPS using the arbiter. On screen two, RAC shows 172.27 FPS and VGA arbiter 172.96 FPS.

This two experiments leads to the conclusion that the performance overhead of the arbiter is comparable of the RAC. Cool!

One thing that we must keep in mind is that the arbitration also adds the functionality to use various clients of the arbiter at the same time, for instance to deploy a multiseat starting several instances of Xorg (which is my big goal).


21:53 < airlied> vignatti: you should also mention that the arbiter lets GPUs
completely opt out of VGA life if they can disable their VGA
decoding resources..
21:53 < airlied> vignatti: which means you end up with no arbitration for those
cards so no overhead.

Thanks for remember airlied :)

I’ll do another post entry concerning how to give a try of it all. For now I’m spending all my hacking time trying to solve others — not so related — things such as why the Xorg using the pciaccess rework doesn’t work with multiple cards. So sad :(