Xorg input thread – summary or something #3

Not so much, but here are the news:

  • For the final evaluation period on the Summer of Code, Daniel suggest me to start my own X server tree. So I’m maintaining this one with the last bits of the X server input thread implementation and always trying to keep all the things up to date with the upstream tree. Everyone is very welcome to test it and report me the few – I expect – bugs.
  • The last new regarding the thread implementation is that we’re unfortunately dealing with critical sections inside the X event queue (mi/mieq.c), so mutex is needed there (just to note: currently, mutex is only implemented using pthread (X11/Xthreads.h), so we also need implement something that is pthread-independent). Probably can exist other pieces of code that might be in an unpredictable way resulting some race conditions (didn’t note anything strange yet!).
  • I finally managed how to implement the page fault notifier without any patch inside the kernel. The read_cr2() can be called directly since the page fault notifier runs with interrupts disabled. The implementation can be found here.

mlock()’ing adventure

It’s being a great adventure to lock the X input thread on the memory. I’m touching a lot of things that I’d never imagined before :)

To trace the pages that are faulting when I move the device pointer I’m using my own ultra mega super kernel’s page fault notifier. It’s very simple, but as the things are not always perfect, it needs a little patch in the kernel.

The page fault notifier does (almost) all that I need to trace exactly which piece of code inside X is causing the page faults. So, as I said here, I compare the notifier’s output with the symbol table of X binary disassembled. Believe me, it works!

So I spent a lot of time seeing which address is faulty, searching it on the X code and locking it on the memory. When the variable is global I move it to other ELF section called ‘input_data’. When the fault occurs on the text I move it to a ELF section called ‘input_code’. Then I lock this two sections on the phisycal memory using mlock(). Unfortunately the cursor still lags when the system is swaping to death (I used a simple memory hog to get this state). I’ll show you why.

As expected, the page fault notifier still accusing faults comming from the X process, but the address which these faults occur it’s not shown on the objdump’s output, leading me to not lock it (duhh) . Let me explain what I’m doing to test it.

I start the X with the brand new input thread, I run the memory hog and wait for some seconds until it consumes a lot of memory. So then I stop the hog and register the notifier. All the page fault now will be displayed on /var/log/messages. So I move the mouse — attention: this is the exacly moment where a non-locked X process will search for pages and, as the pages are not in memory, will generate a page fault –. When the input code (and datas) is locked it prints this and all the addresses that you see there doesn’t belong to what objdump shows. So what I should lock?! I don’t know… It shouldn’t prints anything if all the code/text were locked correctly (indeed, when I run the test using mlockall() it doesnt’t prints anything). Also, the same test but without locking anything shows this.

So on, I’m not seeing any differences on the cursor’s movement with or without mlock’ing (but yes when I use mlockall() and also when I use the input thread. Don’t make confusion!)

Comments?

Page fault notifier

This week I tried to lock in the physical memory the Xorg’s input code using mlock(). To do this I traced the code minutely and locked all the text and data related to input. I didn’t get success. The mouse still lags when the system is paging (you might remember that with mlockall() all worked wonderful *except* that it eats much memory). So what might be happening is that something is not locked yet. To guarantee it I searched for a user-space tool that traces page fault. I only found the ‘truss’ command on Solaris. Linux (my OS) doesn’t provide no one (‘strace’ don’t do this).

So I surrendered to the kernel space tools putting some ‘print’ in the kernel code (before I tried a little systemtap and kprobe without success). Then I made a kernel module [0] using the notifier scheme which already exists inside the kernel. The problem is that the page fault notifier doesn’t show the address which happened the fault. So I made a patch to increment this functionality [1].

Using ‘objdump -t -d Xorg’ shows all the symbols and addresses I want. Now I must compare the module’s output with the dump and be happy :)

[0] http://web.inf.ufpr.br/vignatti/code/page_fault_notifier.c

[1] http://lkml.org/lkml/2007/7/27/8, consider that the first time that I hacked the kernel code was this week. So if something sounds weird…

Xorg input thread – summary or something #2

 

In the last week, I did some cool experiments to see the effects of the D state acting on the X server process when I start it with and without the input thread and always mlock’ing it.

First I set the grub to start my machine with only 170 mb of physical memory. Then I put a ‘mlockall(MCL_CURRENT)’ just before the call of Dispatch() function, on the main.c. So then I started the server. Well, I called a memory hog to eat all my physical memory and played with the mouse which never gets lagged using or not the separate thread. So the great notice comes when I started an X client (gnome-session) which turns the X process to the D state. The X server without the separate thread lags the cursor because it’s using SIGIO to wake up the device. OTOH, the X with the input thread has a smooth movement anyway.

Got it? In summary, I locked to memory all functions until Dispatch is called and when the X server is in the D state the cursor lags when some clients connected to X are using blocks which aren’t locked. And if we’re using the input thread (consequently not SIGIO) it doesn’t lags!

Here the experiments. When I mlockall(MCL_CURRENT) at that point before Dispatch, the X starts with:
– 7412 kb of resident memory, without the input thread.
– 7412 kb of resident memory, with the input thread, using clone syscall to create the child process.
– 15 mb of resident memory, with the input thread, using pthread to create te child process (yeah, pthread really bloats it).

Of course all these 3 values of resident memory above never decreases. OTHO, without mlock and with, or without, the thread the X starts with
about 4080 kb of resident memory which decreases until about 304 kb.

Now I’m trying to figure out how exactly put all the data and functions inside a section of the ELF file. For this I’m using the asm inline code to get the start and the end of the section which is responsible for the mouse and then locking it with mlock().

It’s very hard to ‘automatically’ examine all data and text code [1] that deals with the mouse movements using -finstrument-functions just like Keith said (just to have an idea, until arriving in the Dispatch() we have about 240000 function calls!). What remains is try to examine the code ‘statically’, which IMHO is hard. Hard because even if we minutely trace the code, we’ll forget some global data and simple functions (like xfree, for instance). Well, my attempt to do this statically failed. So yesterday I spent some time trying to figure out a better way to deal with this issue.

I thought that a userspace tool that prints something when a page fault occurs is good enough. Google tells me that ‘truss’ with ‘-m fltpage’ arg does exactly what we need [2, 3]. But the problem is that it doesn’t exists a port for Linux. Neither strace has the ‘fltpage’ similar truss’s option. Then I dig a little more and found Ulrich Drepper’s pagein tool [4]. My simple tests here demonstrates that this tool does not print a page that isn’t hit a twice in memory (I already mailed him to obtain more infos about it).

So, you guys understand where we are? I really want to avoid the kernel traps which tells when a page fault occurs. Also, maybe someone here could point me mailing lists or someone to give tips about this kind of problem. And please, post your comments.

Xorg input thread – summary or something

This mail that I’ve sent to xorg mailing list tells the current state of my project.

– cut here –

Hi guys.

As you might noted here [1], my GSoC’s project is to do a separate mouse thread for the X server. Now, I’m really stucked with it and I need some good ideas from you before go to the next steps.

Today the cursor lags in two situations on Xorg:

1) lot of rendering on the server (CPU usage)

This lags the cursor only if the rendering is done by sw. So, if we’re worried only with hw cursor then CPU is definitely not our problem. Should we take care with the sw cursor for now? And the MPX case which only do sw cursor?

Q: How to reproduce 1)? A: “x11perf -putimagexy500”

2) heavy memory loads

Under heavy memory usage we’ve got two problems: the X server process in the uninterruptible sleep (‘D’ state) and some parts of the server getting paged to the disk (which leads to the first). These two problems happens when all the physical memory has ended up.

The good news: since my approach of implementation is not using signals (SIGIO) in the input thread, the D state problem is the first which is over. The bad news here is that I didn’t note any performance difference on the cursor movement with heavy memory loads :(

Also, different from what was expected, the input thread is paging to disk. I tried the Jesse Barnes suggestion [2] to mlock the thread with no real success (with or without the input thread when I mlock some mice functions I obtained an unbelievable smooth movement. But I know that this isn’t an elegant solution).

Q: How to reproduce 2)? A: a malloc hog.

The small conclusion of 2): if the real focus of the input thread is to stop with the cursor’s lag then we must provide other ways to keep the cursor’s footprint in the physical memory. (Should I consider the Jesse’s suggestion to put this all inside DRM? I really don’t know how difficult this can be. Jesse, please?)

Also, if we’re running to achieve the 2) solution, the real interest will be systems with few memory (embedded and so on)? On this mobile systems people active the swap all the time (the OLPC’s laptop not, right?)? This leads to other question: would really advantageous to do the input thread only having in mind tiny systems?

So far, we’re not requiring any thread lock mechanism. (Yes, I already tested it on a SMP machine)

To end with a pessimist quote from Jim Gettys [3]:
“And I don’t want all input events routed through a secondary input process, as that has bad effects on latency (we can’t guarantee that such a helper process gets scheduled at the right moment, and latency variability drives people nuts in interactive situations). So through such a module, the X server would call all the way down to the input device or socket (depending on input type), and not be subject to such variability.”

Well, the last patch you can see here (it’s tiny! Go ahead and tell me something about it!):
http://web.inf.ufpr.br/vignatti/xorg/xorg-input-thread_03Jul.diff

I’ll be really appreciating any comments on this mail, please.

Thanks!

[1] http://lists.freedesktop.org/archives/xorg/2007-June/025610.html

[2] http://lists.freedesktop.org/archives/xorg/2007-June/025612.html

[3] http://lists.freedesktop.org/archives/xorg/2005-August/009626.html

Moving the mouse handling code into a separate thread

(In a puny attempt to write my SoC project progress to my mentor, I
decided to expand it and share my thoughts with you)

Today, we have two methods to register the pointer devices on Xorg
server
: (1) under SIGIO and (2) put they fd on EnableDevices set.
There is also the silken mouse concept, which means updates fired during
sigio handler (in the case of hw cursor).

We always try to prioritize silken – i.e. when the device emits a move
event, it will be “painted” on the screen and the WaitFor loop still
continue sleeping on select() – But the problem with SIGIO is that it
can blocks if the main thread is wedged doing kernel stuff (like
paging). It can’t interrupt a program in D state.

So, the basically idea is to do a separate thread which takes care the
mouse handling code without using SIGIO. I did an approach and some
questions were raised up:

(1) With (silken/hw cursor) or without the input thread seems to be
equal in perfomance (tested with three video cards: ATI Rage XL, GeForce
FX 5500 and GeForce2 MX/MX 400). I’d tested with a gnome-session started
and ran ‘x11perf -putimagexy500’. The cursor never lag the mouse in both
situations. At least no performance regressions :) Fine.

(2) But I think that (1) is not the exactly problem which we’re trying
to solve. Daniel Stone said once to me that having a tiny footprint that
needs to be kept in memory, it wouldn’t need to wait to be paged into
the active set all the time. Here Daniel’s transcription: “Currently it
works _almost_ like this, but SIGIO is in the same process, with a very
large memory footprint. So if any part of the X server is waiting to be
paged in to memory, then you’ll be completely blocked on disk I/O. This
is the problem we have today: under heavy disk and memory loads, we end
up blocked on I/O. OTOH, the input thread won’t get paged out, because
its active set will be extremely small”. But how to keep this resident?
Is it inner to the thread?

(3) Another thing that is breaking my head is to not have such a
mechanism to do a real performance test. How to know if the thread has
advanced or not the overall performance? Maybe using the ‘time’ tool?
Maybe something with xtest? I don’t know.

(4) So far I’m not facing any problem concerning the thread safety.
Yesterday, on the IRC, Mercury and Clee tell me to test the input
thread on SMP machines to really do it parallelized. I haven’t done it
yet. Some another tips here?

The input thread (using clone syscall) is on my Linux machine. The patch
applies with the last git evdev and xserver. You can see it here:

http://web.inf.ufpr.br/vignatti/tmp/xorg-input-thread.diff

I’ll really appreciate comments.