All about virtual memory II – Operation

Following the series, in this post we will discuss how virtual memory works under the hood. We will take a general look on how virtual memory is implemented. This include page tables management, address translation mechanism, or how page misses are treated. Finally, we will discuss how to optimize the translation process by explaining what is a TLB. Without further delay, let’s __start!

Introduction

Before we start, I would like to define some terminology.

Virtual Memory: The memory that a process “sees”. More precisely, all the memory addressable by a process.
Physical Memory: All the RAM memory installed in the system.
Linear Address Space: Denotes all addresses that can be formed on the system.
Linear Address: Address for any byte in linear address space.
Virtual Address Space: Linear address space assigned to each program. For instance, in a 32 bits machine space’s range is (0, 2³² – 1).
Virtual Address: Address used by a program.
Physical Address: Address that the processor addresses on its bus. In the end, these type of addresses are used by the processor to interact with the components.

Virtual Address Translation

As we have mentioned before, the reality is that programs use virtual addresses that get translated by the MMU into physical ones. For example, consider that our program has the following instruction.


mov rcx, [rsp + 0x400] ; [rsp] = 0x0

When the CPU fetches the instruction, first will translate the virtual address into a physical.
The CPU reads the RAM memory, and return the requested data.

This is the main idea, at this moment explained as simple as possible. In the following sections this process will suffer little modifications, when page tables and TLBs are explained.

Page Tables

At this point, we have a hardware called MMU that performs a translation from a given virtual address into a physical one, but a table is needed to store the mapping between virtual and physical addresses. This table is called page tables, as we will see in a moment.

**Example of page tables I**
VA	PA
0x512	0x12
0x786	0x3
0x1024	0x2

The table has a row describing the mapping. The problem using this approach is that the total number of entries in the table for a 32 bits system ascends to 2³⁰ entries (1GiB), one entry per word. In case we want to store an entry per byte instead of per word, the table’s size would be 4GiB (2³²); in 64 bits it would be unmanageable. The advantage using this approach is the ability to have a fine grain control.

A way to reduce the table size, without drastically lose the fine grain control is to group the entries into chunks called pages. Pages can have different sizes, but typically a page is a chunk of 4KiB. According to Intel’s manual volume 3, pages could be greater in size.

**Example of page tables II**
VA	PA
0x0 – 0xFFF	0x1000 – 0x1FFF

If we use 4KiB pages on a 64 bits system, per entry 512 words are covered, so the total number of entries now is 2³⁰/ 2⁹= 2²¹(2MiB). On the other hand, if we consider 64KiB pages, per entry 8192 words are covered, so the total number of entries now are 2³⁰/ 2¹³= 2¹⁷(128KiB). Grouping the entries in pages, the page tables size is reduced substantially. On the other hand, now is not possible to operate with a single word, instead the whole page is treated as an indivisible unit, reducing the flexibility. Note that to read a single word, now we use an offset. For example, given the table if we would like to access word 4, is as easy as (0x0 => 0x1000) + 0x4 = 0x1004.

In Linux x86_64, by default the page size is 4KiB. This size can be changed according to our system needs. For example, in database environments is typical to have huge pages (2MiB). The following C code can be used to determine the word and page size in our Linux system.


#include <stdio.h>
#include <unistd.h>

int main(void)
{
    printf("Page size: %lu bytes\n", sysconf(_SC_PAGESIZE));
    printf("Word size: %lu bytes\n", sysconf(_SC_LONG_BIT));
    return 0;
}

Address Translation

At this point, we have seen a table that aid in the mapping between virtual and physical addresses, but there are still some caveats to solve. The one we are going to discuss now is about how many bits do we need to store in the page tables.

Imagine now we have the following system:

Size of virtual addresses: 32 bits.
Size of physical addresses: 28 bits.

With that feature, is not possible to perform a direct translation because the virtual address’s size is higher than the physical. To circumvent that, we can split a virtual address into two parts.

Offset: The least N significant bits. The size of N is determined from the page size. For example, if our system has 4KiB pages, offset would be composed by the 12 least significant bits (4KiB = 2¹²).
Frame Number: The subtraction between the number of bits used to represent a virtual address and the offset. In our example, this field is equal to 20 bits (32 – 12 = 20).

With this split, now the address translation mechanism goes as following:

The offset field remains unchanged. This means that if we have an offset field equal to 0x400, the physical address will have the same offset.
A mapping is done to the frame number field. Because 20 bits do not fit into 16, a sort of reduction is done. Once is done, both parts (the reduction done in this step + offset) are joined to obtain a physical address. If the page is not in RAM, at a first sight is not possible to know what physical address belong to the given virtual address. This will cause a page fault.

With this approach we have reduced significantly the page tables size. For instance, now instead of storing 30 bits (one per word), now is necessary to store 20 bits (1MiB <<< 1GiB). If we only had one page tables, that would be perfect. The reality is that a page tables is unique per process, so in a machine with 100 running processes we would only need 100MiB for the tables itself!!

Is possible to reduce even more the size? The answer is yes. Multi level page tables comes to the rescue.

Multi Level Page Tables

The idea behind the concept is simple, instead of using one page tables, we would use 2 (in this example). In modern architectures like x86_64 is possible to use four or five levels.

The firts level is a page tables as described before, but it’s content now will be a pointer to a entry in the second level page tables. The second level stores a pointer to the actual physical page where is the data. Using the offset field, is now possible to access the data. Let’s take an example.

Suppose we have a system with 4KiB page size, 32 bits virtual addresses and 28 bit physical addresses, and given the following 2-level page tables:

L1 PT	L2 PT
DISK	0x0233
0x0003	0x0142
0x0004	0x91d8
0x0006	DISK
…	…
0x00f6	0x23f6

To translate the virtual address 0x00402204:

The offset field remains the same (0x204).
The 20 remaining bits are splitted in two groups of 10 bits, one per level.
1. The 10 most significant bits (0b0000000001) selects the PTE of the first level. That entry is a pointer to the second level page tables.
2. The 10 least significant bits (0b0000000010) selects the PTE of the second level. That entry contains the frame number.
Finally, 0x00402204 => 0x918d204

In the previous example, it is assumed that a program will use virtual addresses in the range [0x00000000 – 0x00010000] and [0xFFFF0000 – 0xFFFFFFFF]. What will be the size of the page tables? Well, the size is determined by 4 KiB for the first-level page tables + 16 KiB for the second-level page tables.

These 16 KiB are obtained as follows:

In the first range, the twenty most significant bits are divided into two groups of ten bits. By converting these numbers [(0b0000000000) – (0b0000000000)] and [(0b0000000000) – (0b0000010000)] to binary, it can be observed that the ten most significant bits corresponding to an entry in the first-level page tables refer to the same entry. Therefore, to represent this range, only one entry will be needed.
In the second range, similar to the first range, the ten least significant bits of that twenty-bit group are converted to binary [(0b1111111111) – (0b1111110000)] and [(0b1111111111) – (0b1111111111)]. It can be observed that the ten most significant bits corresponding to an entry in the first-level page tables also refer to the same entry.

This highlights that only two entries are needed at the top level, so only two entries will point to a second-level table (2 * 8 KiB = 16 KiB).

As we said before, each process has it’s own page tables so what is the minimum amount of memory that should be reserved in RAM for each application page tables? The answer is 4 KiB + 4 KiB. The first 4 KiB corresponds to the first-level table because without it, operations cannot be performed. The other 4 KiB corresponds to a second-level table. This second table is necessary to perform any translation because it is not possible with only the first-level table. This achieves a reduction of 99.804% in the space used (100 – [100 * 2^13 / 2^22]).

Page Faults

As we said before, when a page is not present in RAM, the CPU generates a page fault exception. This mechanism is known as swapping. Swapping has the pro that the system won’t crash if runs out of memory, but has the con that loading data from disk is extremely time consuming. So when a page fault occurs, the exception is treated as following:

The OS choose a page that would be swapped out (RAM -> disk).
- If the content of the selected page has changed, and it’s changes has not been commited yet (dirty bit = 1), data must be saved first.
- If the content has not changed (dirty bit = 0), page gets swapped out
The OS reads the page that is in disk and loads it in ram.
The OS updates the page tables with the new mapping.
Finally, the OS returns the control to the instruction that caused the page fault.

Memory Protection

As mentioned before, each program has its own page tables, so no protection mechanism is needed because each program will perform independent mappings that will not overlap with each other. In the case where two programs use the same library, it can be mapped to the same physical address in both programs because there is no need to maintain two copies of the same library in RAM. However, it is important to avoid mapping, for example, physical addresses belonging to sections (such as the stack) of process B in the page table of process A.

Now we have learn how to perform a translation between virtual and physical addresses and how to reduce the space required to store the page tables, but is is possible to perform the translation faster? In the following section we will discuss what is the TLB and the different cache types.

TLB

The TLB (Translation Lookaside Buffer) is a cache for the page table. This cache is used to perform the translation very quickly because the access time to this cache is less than one clock cycle. Each entry in this cache is used to translate addresses from an entire page. When using the TLB, two things happen:

If the TLB doesn’t have the data, it will go to the main memory and retrieve it, resulting in undesired slow behavior.
Similar to caches, once the data is obtained, subsequent accesses are immediate.

Now, for this cache to be fast, it needs to be small. Just like regular caches, there can be a separation between instruction TLB (iTLB) and data TLB (dTLB). Currently, TLBs have around 64 entries, and depending on the level, they can be fully associative or n-way set associative.

Now, how is it possible for such a small cache (64 entries vs. 1M PTE’s) to be so effective? This is due to the principle of locality. When translating for a specific page, it is highly likely that the next memory usage will involve this page.

What can happen when accessing memory?

The page is in RAM.
- The page table entry is in the TLB: In this case, the performance is excellent because the translation can be done in less than one cycle.
- The page table entry is not in the TLB: In this case, the performance decreases significantly because it now takes between twenty and a thousand cycles to retrieve the page table entry from RAM and then access RAM.
The page is not in RAM.
- The page table entry is in the TLB: Although determining that the page table entry is not in RAM is very fast, the inconvenience here is bringing the page from disk to main memory.
- The page table entry is not in the TLB: The process of determining that the page table entry is not in RAM is slower, and the inconvenience here still lies in bringing the page from disk to main memory.

To improve TLB performance, the following approaches can be taken:

Increase page size: For example, if a page accommodates 4 KiB, with 64 entries, it covers 256 KiB. However, if the number of entries is reduced to 32 and the page size is increased to 2 MiB, it would cover 64 MiB. By using larger page sizes, the TLB can cover more memory with fewer entries, reducing TLB misses.
Add a higher-level TLB: Similar to cache memories, it is possible to have multiple levels of TLBs. For instance, processors like AMD Zen 2 include a first-level TLB with 64 fully associative entries and a second-level TLB with 512 set-associative entries with 8 ways for the iTLB. This hierarchical TLB structure helps improve TLB hit rates and translation performance.
Hardware-assisted TLB filling on page faults: The hardware page table walk mechanism handles TLB misses by going to the main memory and filling the TLB directly, without involving the operating system. This reduces the overhead of TLB misses and improves overall TLB performance.

Let’s see a translation example using the TLB:

We assume: Virtual addresses of 32 bits, page size is 4 KiB. Therefore, the offset is composed of the least significant 12 bits, and the page number consists of the most significant 20 bits. Physical addresses consist of 28 bits. Initially, the TLB is empty.

**Initial TLB State**
TLB
Tag	Physical Page Number
EMPTY	EMPTY
EMPTY	EMPTY

The initial content of the page tables in RAM is the following:

**Initial Page Tables State**
Page Tables
DISK
0x0003
0x0004
0x0006
0x0008
0x0009
…
0x00f6

Now let’s translate the virtual address 0x00003204 to its corresponding physical address.

We check if there is an entry in the TLB with a tag matching the page number field of the virtual address.
- If there is no entry, we access the page table, select the corresponding entry, and bring it into the TLB.
The physical address will consist of the physical page number field from the TLB plus the offset field.

In this case, the TLB is empty, so we access the RAM and select entry 3 (0x00003), and its content is written to the first entry of the TLB.

After the translation, we can proceed to calculate the physical address:

TLB entry (1st entry): 0x00003 (physical page number)
Offset: 0x204

Therefore, the physical address is 0x00003204.

Note: Once the translation is performed, the TLB will contain the entry for future use, improving performance for subsequent translations of the same page.

TLB
Tag	Physical Page Number
0x00003	0x0006
EMPTY	EMPTY

Next, let’s translate the virtual address 0x00003208 to its corresponding physical address. When checking the TLB, there is an entry associated with this virtual address, so the translation is immediate (0x00003208 => 0x0006208).

The third address to translate is 0x00005120;. We follow the same procedure as before until it is written to the TLB. The TLB is updated as follows

TLB
Tag	Physical Page Number
0x00003	0x0006
0x00005	0x0009

Finally, the address will be translated (0x00005120 => 0x0009120).

Now, what happens if we want to translate the address 0x00004200? Since the TLB is full, we need to replace an entry. We follow a Least Recently Used (LRU) algorithm for replacement. In this example, we will replace the first entry since it was the one used least recently.

TLB
Tag	Physical Page Number
0x00004	0x0008
0x00005	0x0009

Finally, the translation will be performed (0x00004200 => 0x0008200).

Lastly, what happens if we want to translate the address 0x00000860? First, we need to update the entry in the TLB. Secondly, since it is a page that is on disk, a page fault will be triggered, which will bring the page from secondary storage into main memory. Once this exception is handled, the page can be accessed.

TLB
Tag	Physical Page Number
0x00004	0x0008
0x00000	DISK

Physical Caches

This type of cache is named as such because when accessing the cache, it is done using a physical address. The process is as follows:

The processor issues a virtual address (VA).
This address goes through the TLB to be translated into the corresponding physical address (PA).
Using this PA, the cache is checked. If there is a cache hit, the data is returned. If there is a cache miss, the data is fetched from main memory.

The problem with this type of cache is that it is necessary to look up the TLB every time there is a cache access. This additional step of TLB lookup can introduce overhead and potentially impact the overall performance of cache access.

Virtual Caches

This type of cache is named as such because when accessing the cache, it is done using a virtual address. The process is as follows:

The processor issues a virtual address (VA).
With this VA, the cache is accessed. If there is a cache hit, the data is returned. If there is a cache miss, the VA is translated into its corresponding physical address (PA).
Once the PA is obtained, main memory is accessed to fetch the data.

These caches are faster than physical caches since the translation is only performed if there is a cache miss, rather than on every access. The issue with this type of cache is that being virtual, it lacks the protection offered by the virtual memory mechanism. Two programs cannot share a virtual cache because a VA from one program may collide with that of another program.

VIPT (Virtual Indexed Physically Tagged) Caches

“Data in the cache is indexed by the virtual address, but tagged by the physical address.”

This describes a special type of cache that combines the advantages of both physical and virtual caches. The idea is to use the virtual address as an index to select an entry in the cache. Once the corresponding entry is selected, its content, which is a physical address, is used to verify that the retrieved data is correct. This is made possible by performing a parallel translation in the TLB, and if the translated address matches the content of the cache, the data can be retrieved.

The offset field of a virtual address is used to index into the cache.
The virtual page number field of the virtual address is used to parallel index into the TLB.
If the field in the cache corresponding to the physical tag is equal to the physical page number obtained from the TLB lookup, it results in a cache hit.

The only limitation is that only the bits from the offset field can be used to index into the cache, which limits the cache size. For example, if the offset field has 12 bits, only 4KiB can be stored in the cache. If the cache is associative, the size can be increased based on the number of ways.

Summary

During the whole post, we have understood how virtual memory works with a very general view, without focusing on any particular arch. We have also proposed some improvements to the basic explication that makes the whole process much more efficient.

In the following post, we will focus in the x86_64 arch, by building a Linux kernel module that performs a page walk. See you soon!