Fixing a Kernel Memory Leak: Hunting Down Unmapped vDSO and VVAR Pages in the Linux x86 Architecture
Overview
When writing high-performance low-level systems code, every single byte matters..
Recently, I noticed a subtle memory leak during process teardown routines, flagged by tools like kmemleak. The culprit was a hidden gap in how the Linux Kernel handles unmapping of memory for the Virtual Dynamic Shared Object (vDSO) and its companion data page, VVAR, specifically on x86 architectures.
My patch addressing this issue was recently accepted and merged into the upstream Linux kernel tracking tree (x86/urgent). In this post, I want to take you through what vDSO and VVAR are, the precise mechanics of the leak, and how the fix was implemented.
Understanding the Basics: What are vDSO and VVAR?
To appreciate the bug, we have to talk about system calls (syscalls). Normally, transitions from user space to kernel space involve context switching (via instructions like syscall). Context switches carry a performance penalty because they flush TLB entries, swap page tables, and change execution privileges.
For frequent, read-only system calls like gettimeofday() or clock_gettime(), switching context thousands of times per second introduces unacceptable overhead.
To solve this, Linux introduces two special constructs mapped directly into the address space of every user space process:
* vDSO (Virtual Dynamic Shared Object): A small shared library provided by the kernel that contains safe implementations of specific system calls. Instead of executing a trap instruction, the application calls a function inside the vDSO page directly in user space.
* VVAR (Virtual Variable Page): The vDSO code needs accurate data (like current clock cycles or monotonic time counters). This data resides in the kernel, but the kernel mirrors a read-only view of it directly into user space via the VVAR memory page. The vDSO functions read directly from the VVAR address space without requiring context elevation.
Because these pages look and behave like user space memory addresses to the binary, the kernel must map them when a process starts, and properly unmap and free tracking structures when the process exits or updates its mappings.
The Problem: The Unmapping Gap
On function map_vdso() happen the initialization of VDSO and VVAR pages.
This take of code, represents the installation of VVARs:
text_start = addr + __VDSO_PAGES * PAGE_SIZE;
/*
* MAYWRITE to allow gdb to COW and set breakpoints
*/
vma = _install_special_mapping(mm,
text_start,
image->size,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
VM_SEALED_SYSMAP,
&vdso_mapping);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
goto up_fail;
}
vma = vdso_install_vvar_mapping(mm, addr);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
do_munmap(mm, text_start, image->size, NULL);
goto up_fail;
}
vma = _install_special_mapping(mm,
VDSO_VCLOCK_PAGES_START(addr),
VDSO_NR_VCLOCK_PAGES * PAGE_SIZE,
VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
VM_PFNMAP|VM_SEALED_SYSMAP,
&vvar_vclock_mapping);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
do_munmap(mm, text_start, image->size, NULL);
do_munmap(mm, addr, VDSO_NR_PAGES * PAGE_SIZE, NULL);
goto up_fail;
}
... this portion of code, have a wrong behavior on the unmapping VVAR pages when a error happens. See this part of the code:
...
vma = _install_special_mapping(mm,
VDSO_VCLOCK_PAGES_START(addr),
VDSO_NR_VCLOCK_PAGES * PAGE_SIZE,
VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
VM_PFNMAP|VM_SEALED_SYSMAP,
&vvar_vclock_mapping);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
do_munmap(mm, text_start, image->size, NULL);
do_munmap(mm, addr, VDSO_NR_PAGES * PAGE_SIZE, NULL);
goto up_fail;
}
...
When a error occurs on the installation of vma, the code unmaps the pages:
...
do_munmap(mm, text_start, image->size, NULL);
do_munmap(mm, addr, VDSO_NR_PAGES * PAGE_SIZE, NULL);
...
But, the addr (that is a vdso vvar mapping), is allocation by function vdso_install_vvar_mapping(...). That is the code os this function:
struct vm_area_struct *vdso_install_vvar_mapping(struct mm_struct *mm, unsigned long addr)
{
return _install_special_mapping(mm, addr, VDSO_NR_PAGES * PAGE_SIZE,
VM_READ | VM_MAYREAD | VM_IO | VM_DONTDUMP |
VM_MIXEDMAP | VM_SEALED_SYSMAP,
&vdso_vvar_mapping);
}
You can see that the size of allocation is VDSO_NR_PAGES * PAGE_SIZE, but the unmapping of this allocation has a size image->size:
do_munmap(mm, text_start, image->size, NULL);
So, this throw a memory leak because the unmapping of vvar pages is not correct, and the kernel does not free the internal bookkeepers of this allocation.
The Solution: Fixing the Teardown Logic
The solution involved tightening the reference checks inside the architecture-specific x86 memory management code.
By ensuring that the clean-up callbacks correctly identify when both the text pages (vDSO) and data pages (VVAR) are being discarded, we ensure that all internal structures bound to the process context are explicitly freed.
The patch guarantees that when the virtual memory area (VMA) is destroyed, the kernel properly updates internal bookkeepers, plugging the leak completely and keeping memory allocations entirely clean over long-running workloads.
You can view the full commit and line changes here: Linux Kernel Git Repository
The diff:
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index a6bfcc8243cd9..d903bce24f15d 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -178,7 +178,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
do_munmap(mm, text_start, image->size, NULL);
- do_munmap(mm, addr, image->size, NULL);
+ do_munmap(mm, addr, VDSO_NR_PAGES * PAGE_SIZE, NULL);
goto up_fail;
}
Links
LKML thread
Linux Kernel Git Repository