Scrutinizing X memory, part 2: what’s taking all that memory?

So here goes some statistics of the Xorg process running. All the informations were fetch from /proc/`pidof Xorg`/{smaps, status}. I used also a script found on the Web to parse and organize these informations; Mikhail Gusarov has extended this script to show a very useful output.

Xorg per se

Running just one standalone `Xorg -retro`. In my system it represents:
VmRSS: 5440 kB
VmSize: 13620 kB

from those 5440 kB of RSS:
3404 kB (63 %) come from code
1628 kB (30 %) come from malloc/mmap in anonymous memory (heap)
228 kB (4 %) come from other data mapped in memory
180 kB (3 %) come from rodata

from those same 5440 kB of RSS:
1628 kB (30 %) come from malloc/mmap in anonymous memory (heap) somewhere*
1200 kB (22 %) come from Xorg
628 kB (12 %) come from libc
316 kB (6 %) come from libcrypto
164 kB (3 %) come from libint10
136 kB (2.5%) come from libXfont
128 kB come from libxaa
120 kB come from libpixman
116 kB come from nv_drv
112 kB come from ld
102 kB come from libglx
100 kB come from swrast_dri
88 kB come from libfb
60 kB come from libpthread
48 kB come from evdev
xxx kB come from other libraries**

* just looking into /proc/, there’s no way to determine if the allocations came either from the binary itself or some DSO. I’ll definitely analyse carefully this in a near future using another approach.

** it’s missing from these numbers the input hotplug layer, which mostly systems are using today. In another data collected, I’ve seen dbus + hal taking 268 kB against amazingly 64 kB from libudev.

These measurements are not perfect; they are a snapshot of the memory when the server just started. The same footprint brought to memory at Xorg’s initialization time will differs a lot from the regular usage of the rest of Xorg’s life, which would deals with clients and users interacting. For instance, libint10 is mapping 164 kB and it’s likely that will never be swapped back to the memory again. Likewise, the heap portion will increase when clients starts to allocate pixmaps on the server.

Even though, we can see some nice facts. From the first chart, we see that almost 2/3 in RSS is used by instructions. Is it a normal behaviour of a graphics server? I don’t know. In the other chart, we see a huge footprint of libcrypto. In such library, when not counting shared mappings (e.g. used by openssl), it’s using 88 kB of RSS for private mappings only – sigh. We probably can replace it by other SHA1 implementation (in fact, we have already others inside the server) or use our built-in. We have also libpthread, used in GLX, which is being built even on systems that are not using it (e.g. Maemo on N900). libXfont shows up as a surprise to me either, taking a considerable amount of memory. We’re probably able to tweak it a bit though.

the code being started

Another way to analyse Xorg, is getting informations per code and modules being started. So I first set a breakpoint in InitOutput() function. Until InitOutput() be called:
VmRSS: 1728 kB
VmSize: 8788 kB

from 1728 kB in RSS:
1336 kB (77.3 %) come from code
132 kB (7.6 %) come from malloc/mmap in anonymous memory (heap)
144 kB (8.3 %) come from other data mapped in memory
116 kB (6.7 %) come from rodata

from 1728 kB in RSS:
436 kB (25.2 %) come from libc
328 kB (19 %) come from Xorg
316 kB (18.3 %) come from libcrypto

A breakpoint in InitOutput() means the very first steps of Xorg initialization: command line processing, OS layer being started and other basic routines. At this point, naturally it wasn’t executed much code inside Xorg yet, neither any drivers were loaded. Therefore, almost half memory usage of the process (44 %) came from basic libraries start up such as libc, libcrypto, etc.

The next chart, when setting a break point at InitInput(), shows the moment that the output is mostly done. I.e., internal loader initialized, configuration and its parsing done and output drivers already loaded. Until InitInput() be called:
VmRSS: 4436 kB
VmSize: 13724 kB

from 4436 kB in RSS:
3352 kB (75.6 %) come from code
676 kB (15.2 %) come from malloc/mmap in anonymous memory (heap)
228 kB (5.1 %) come from other data mapped in memory
180 kB (4 %) come from rodata

We see the the server’s RSS has jumped 2708 kB from the previous chart. In other words, it represents 2708 kB, or 50%, just being used to output’s initialization, and that 1004 kB (18.4 %) will be used for input initialization routines.

Well, I’m already happy with these preliminary statistics. I guess we have already work to do just looking into. Now, I plan to investigate a bit further X’s heap creation and how efficiently X clients are using pixmaps.

As always, I appreciate any corrections, suggestions and improvements.

* this text was kindly reviewed by Mikhail Gusarov.

Scrutinizing X Memory, part 1: overview

This series of documents explore how the memory is used by the Xorg server. They aim to eventually shrinks the memory footprint of the server and its related components, like X clients, modules being loaded and drivers. Embedded devices with constrained resources are the main focus here. All texts are mostly based on x86 and ARM architectures, under Linux 2.6.33 with Xorg from upstream.


One way to analyse aspects of memory usage of a given program is to scrutinize its object data. Object data contains executable code and static data. Both are of little interest from the process memory management point of view given their layout is determined by the compiler and does not change during process execution. However, we can deduce some nice informations about the object. For instance, from Xorg object we are able to get some statistics about the code, identify its structure and point out architectural mistakes just looking into.

Besides the object itself, also important is to see it in execution and how the dynamic allocations are performed on the stack and heap. So an analysis of the file object running is valuable as well.

X file object

Consider the following sections of Xorg:

.text: contains the instructions executed by the CPU and all constant data – literals. While the program is being executed, pages are loaded into physical memory carrying instructions and literals.

The number of lines in X code is huge, which in some way impacts in a huge .text segment size. In my environment .text is 1833738 bytes (1.74 MB) when the compiler is performing third degree of optimization (-O3). In a very gross view, removal of code means less instructions to execute, consequently less text and less memory footprint. For instance, just a single inclusion of fprintf will cost ~40 bytes of text in your object. Of course it’s not straightforward to cut off code all over the server, but for a given device/environment we can customize it, as already discussed.

Besides code elimination, optimize the code using compiler’s size optimization (-Os) helps a lot either: 260 kB of RSS saved here, only optimizing X server. So we might considered this and also apply the same idea in DSOs. For instance, the size of pixman library mapped on the server shrinks 30% when compiled with size optimization. Good job, compiler!

.data and .bss: static or global variables allocated at program startup.

If the variables allocated in compilation time are not initialized, then BSS (Block Started by Symbol) increases; increase BSS means also increase VM (Virtual Memory), but not necessarily RSS. The VM size is quite meaningless when measuring real memory usage. So I wouldn’t bother to analyse BSS, given the RSS occupied by X is what I really care.

On the other hand, .data section increases when some data object is initialized for permanent variables. And if these variables is being accessed, it increases directly the physical memory. A good habit here is to declare constant variables whenever is possible, so then they go to .text segment and the compiler might be able to perform optimizations.

X dynamic allocations (stack, heap and friends)

Probably this is where there’s more room for optimizations. The heap grows in response the program needs: a program like “ls” will not make a lot of demands on the heap (one hopes), while the heap of a running Xorg can grow in a truly amazing way. It shouldn’t be hard to profile all allocations done inside the server. Probably valgrind’s massif with a bunch of arguments give this for us.

X clients are able to request the server to allocate pixmaps in its own memory. Such feature is one of the main reasons of the growing-shrinking in the server’s memory footprint. Because of that, it’s very usual to see people getting confused thinking there’s a leak on the server while actually it’s on client side.

Besides heap allocations there’s also the stack, used to hold automatic variables and functions data. I don’t think there’s much to track in stack memory or ways to save overall process memory. But a good rule to follow is that typically allocation here is much faster than for dynamic storage (heap or free store), because a memory allocation in the stack involves only pointer increment rather than more complex management.

The ideas above were just an overview where we can start to work on. I don’t believe there’s an unique and certain point that we can go and fix X memory usage. We should analyse the code and attack all sides.

Next, I’ll analyse in depth each of these dynamic and static allocation ways discussed in this document, starting doing some statistics where X sucks more… memory :-P I’ll appreciate any kind of corrections/suggestions on these documents.

* this text was kindly reviewed by Ander Conselvan and Mikhail Gusarov.