Page fault notifier

This week I tried to lock in the physical memory the Xorg’s input code using mlock(). To do this I traced the code minutely and locked all the text and data related to input. I didn’t get success. The mouse still lags when the system is paging (you might remember that with mlockall() all worked wonderful *except* that it eats much memory). So what might be happening is that something is not locked yet. To guarantee it I searched for a user-space tool that traces page fault. I only found the ‘truss’ command on Solaris. Linux (my OS) doesn’t provide no one (‘strace’ don’t do this).

So I surrendered to the kernel space tools putting some ‘print’ in the kernel code (before I tried a little systemtap and kprobe without success). Then I made a kernel module [0] using the notifier scheme which already exists inside the kernel. The problem is that the page fault notifier doesn’t show the address which happened the fault. So I made a patch to increment this functionality [1].

Using ‘objdump -t -d Xorg’ shows all the symbols and addresses I want. Now I must compare the module’s output with the dump and be happy :)


[1], consider that the first time that I hacked the kernel code was this week. So if something sounds weird…

Xorg input thread – summary or something #2


In the last week, I did some cool experiments to see the effects of the D state acting on the X server process when I start it with and without the input thread and always mlock’ing it.

First I set the grub to start my machine with only 170 mb of physical memory. Then I put a ‘mlockall(MCL_CURRENT)’ just before the call of Dispatch() function, on the main.c. So then I started the server. Well, I called a memory hog to eat all my physical memory and played with the mouse which never gets lagged using or not the separate thread. So the great notice comes when I started an X client (gnome-session) which turns the X process to the D state. The X server without the separate thread lags the cursor because it’s using SIGIO to wake up the device. OTOH, the X with the input thread has a smooth movement anyway.

Got it? In summary, I locked to memory all functions until Dispatch is called and when the X server is in the D state the cursor lags when some clients connected to X are using blocks which aren’t locked. And if we’re using the input thread (consequently not SIGIO) it doesn’t lags!

Here the experiments. When I mlockall(MCL_CURRENT) at that point before Dispatch, the X starts with:
– 7412 kb of resident memory, without the input thread.
– 7412 kb of resident memory, with the input thread, using clone syscall to create the child process.
– 15 mb of resident memory, with the input thread, using pthread to create te child process (yeah, pthread really bloats it).

Of course all these 3 values of resident memory above never decreases. OTHO, without mlock and with, or without, the thread the X starts with
about 4080 kb of resident memory which decreases until about 304 kb.

Now I’m trying to figure out how exactly put all the data and functions inside a section of the ELF file. For this I’m using the asm inline code to get the start and the end of the section which is responsible for the mouse and then locking it with mlock().

It’s very hard to ‘automatically’ examine all data and text code [1] that deals with the mouse movements using -finstrument-functions just like Keith said (just to have an idea, until arriving in the Dispatch() we have about 240000 function calls!). What remains is try to examine the code ‘statically’, which IMHO is hard. Hard because even if we minutely trace the code, we’ll forget some global data and simple functions (like xfree, for instance). Well, my attempt to do this statically failed. So yesterday I spent some time trying to figure out a better way to deal with this issue.

I thought that a userspace tool that prints something when a page fault occurs is good enough. Google tells me that ‘truss’ with ‘-m fltpage’ arg does exactly what we need [2, 3]. But the problem is that it doesn’t exists a port for Linux. Neither strace has the ‘fltpage’ similar truss’s option. Then I dig a little more and found Ulrich Drepper’s pagein tool [4]. My simple tests here demonstrates that this tool does not print a page that isn’t hit a twice in memory (I already mailed him to obtain more infos about it).

So, you guys understand where we are? I really want to avoid the kernel traps which tells when a page fault occurs. Also, maybe someone here could point me mailing lists or someone to give tips about this kind of problem. And please, post your comments.

Xorg input thread – summary or something

This mail that I’ve sent to xorg mailing list tells the current state of my project.

– cut here –

Hi guys.

As you might noted here [1], my GSoC’s project is to do a separate mouse thread for the X server. Now, I’m really stucked with it and I need some good ideas from you before go to the next steps.

Today the cursor lags in two situations on Xorg:

1) lot of rendering on the server (CPU usage)

This lags the cursor only if the rendering is done by sw. So, if we’re worried only with hw cursor then CPU is definitely not our problem. Should we take care with the sw cursor for now? And the MPX case which only do sw cursor?

Q: How to reproduce 1)? A: “x11perf -putimagexy500”

2) heavy memory loads

Under heavy memory usage we’ve got two problems: the X server process in the uninterruptible sleep (‘D’ state) and some parts of the server getting paged to the disk (which leads to the first). These two problems happens when all the physical memory has ended up.

The good news: since my approach of implementation is not using signals (SIGIO) in the input thread, the D state problem is the first which is over. The bad news here is that I didn’t note any performance difference on the cursor movement with heavy memory loads :(

Also, different from what was expected, the input thread is paging to disk. I tried the Jesse Barnes suggestion [2] to mlock the thread with no real success (with or without the input thread when I mlock some mice functions I obtained an unbelievable smooth movement. But I know that this isn’t an elegant solution).

Q: How to reproduce 2)? A: a malloc hog.

The small conclusion of 2): if the real focus of the input thread is to stop with the cursor’s lag then we must provide other ways to keep the cursor’s footprint in the physical memory. (Should I consider the Jesse’s suggestion to put this all inside DRM? I really don’t know how difficult this can be. Jesse, please?)

Also, if we’re running to achieve the 2) solution, the real interest will be systems with few memory (embedded and so on)? On this mobile systems people active the swap all the time (the OLPC’s laptop not, right?)? This leads to other question: would really advantageous to do the input thread only having in mind tiny systems?

So far, we’re not requiring any thread lock mechanism. (Yes, I already tested it on a SMP machine)

To end with a pessimist quote from Jim Gettys [3]:
“And I don’t want all input events routed through a secondary input process, as that has bad effects on latency (we can’t guarantee that such a helper process gets scheduled at the right moment, and latency variability drives people nuts in interactive situations). So through such a module, the X server would call all the way down to the input device or socket (depending on input type), and not be subject to such variability.”

Well, the last patch you can see here (it’s tiny! Go ahead and tell me something about it!):

I’ll be really appreciating any comments on this mail, please.