Place Holder Products Code
Bash MySQL
Notes Return of the Fed Login
Admin Control Panel Email Control Panel Product Control Panel Debug Info Beacon Create Snippet Tag Control Panel

Notes

*Most of these are works in progress



KP NOTES == 02. Booting x86 == At Poweron: - CPU begins executing BIOS code - Detect and initialize devices/memory - Find bootable device - Read boot sector into memory & start executing it - Memory Layout: - Device Memory - BIOS ROM - Bootloader loaded into mem below dev. mem/ROM At boot loader: - CPU in 'Real' mode - 20 bit memory addrs (up to 1 meg) - Segmentation used ie. '0x12:0x1234' - CPU translates as by doing segment * 16 + addr Transition to 'Protected' mode: - Need to setup GDT - Defines memory areas, protections, tasks... - Memory Segment Descriptors define memory range + protections Segment Registers: - Real Mode: used for addressing - Prot. Mode: used for GDT offset - GDT segment entry describes physical memory base & length - Segmentation not used in most archs, instead they use paging Protected Mode: - Bootloader reads the kernel into memory - Jumps to the kernel's entry func Entering Kernel: - Enables Paging - Memory allocation and protection via virt. addr. spaces - Initialize global kernel vars (BSS) - Inits console - Inits memory == 03. Managing Physical Memory == UMA vs. NUMA: - Uniform Memory Access (UMA): - All processors share same latency to memory - Single Memory for all cores - Doesn't scale well w/ #cores - Non-Uniform Memory Access (NUMA): - Processors have own local memory with fast access times - Processors can access other memories, but more slowly - Scales better than UMA Physical Memory in Linux: - Logically divided into pages (page frames) - Nodes: - NUMA abstraction of N banks - Zones: - DMA: 0-16 MB DMA - NORMAL: 16-896 MB Kernel - HIGHMEM: 896MB-END Kmap - Tagged regions for each node - Pages: - Physical page frame for each zone - 4KB units - page object structs in a memory_map array Allocators in Linux: - SLAB Allocator: - Allows cache creation, which store objects of same size. - Size can be smaller/bigger than a page - Can cope with internal fragmentation, ie. entire page used for 10B of data - VMALLOC Allocator: - Deals w/ Non-physically contiguous memory - Page Allocator: - Deals /w contiguous areas of phys. pages - MemBlock Allocator Memblock Allocator: - Early boot-time allocator - Discarded after initializing Buddy allocator (linux) - Maintains two arrays, memory (all mem), and reserved (allocated mem) - Allocates by finding regions in "mem && !reserved" Buddy Allocator: - "Power of 2 allocator with free coalescing" - Blocks arranged in 2^N (order) pages - Allocations satisfied by exact N: - Split larger 2^(N+1) blocks into 2x2^N blocks - The two resulting blocks are "buddies" - Deallocations return block to allocator: - If the returned block's buddy is free, coalesce into larger block - Keep coalescing until it's no longer possible - Freelists per order (N) maintain free blocks of that size Fragmentation: - External: Inability to service allocations due to too many small blocks - Addressed in buddy with free coalescing - Internal: Space wastage due to too many large blocks assigned to small allocations - Addressed w/ Slab allocators Memory Errors: - Out of Bounds Access: - Accesses outside allocation - Uninitialized Reads: - Reads of uninitialized data might lead to previous allocation's data being read - Use-after-free: - Access to data after free - Memory Leak... Anti-Memory Errors: - Page Guarding: - Detects out of bounds access - Each allocation looks like: - [Guard Page][Data Page][Guard Page] - Marks page as guard, allowing detection - Page Canaries: - Lazily detects out of bounds access - Each alloc looks like: - [Canary Page][Data Page][Canary Page] - Canary pages contain specific data - Can lazily detect out of bounds writes by mismatch w/ canary data - Page Remapping: - ? - Page Poisoning: - Lazily detects write-after-free - When page is freed, it is written with known value - Write after free detected by mismatch - If the memory is resused, ie. written to normally, we can't detect it with page poisoning Linux Memory Protection "Sanitizers": - kmemcheck (uinitialized reads): - Traps on every read to check for initialization - kmemleak (memory leaks): - Periodic Garbage collection - Reports objects not pointed to by any "likely" pointers - kasan (out of bounds & use after free): - Combines canaries and poisoning ideas - Checks each access to see if target is a live object JOS Boot Memory Allocator: - Just maintains an index to next free block - Only used for a few long-lived structures, eg. initial page dir JOS Page Frame Allocator: - Simple freelist based allocator - Allocates a single page at a time - Maintains free and allocated list == 05. Page Table Management == Address Spaces: - Provides isolation and protection - Very flexible - Comes at the cost of address translation x86_32 Page Table Entry: - Describes a page (4KB) - Format: Bits: [31 - 12][11 - 9][8 - 0] Values: [Phys. Addr][Avail.][Flags] - Flags: - Global: Prevents TLB from updating line if cr3 reset (eg. at context switch) - Dirty: Indicates page has been written to - Accessed: Indicates r/w executed on page - Cache Disabled: Indicates not to cache this page - Write Through: Indicates Write-Through caching enabled - User/Super: Indicates avail to all, else only super - Read/Write: Indicates R/W page, else only read - Present: Indicates presence of page in physical memory (eg. not swapped) - Page Dir Entry vs. Page Table Entry: - Indentical in form except: - PgDir: Addr is page table addr - PgDir: Global Bit ignored - PgDir: No Dirty Bit - PgTable: No Page-Size Bit - In PgDir this indicates huge pages x86_32 Virtual -> Phyiscal Translation: - Read cr3: Addr of top level page table (page directory) - Use high 10 bits of VA as offset into table - Assuming entry exists, addr stored in entry is addr of page table - VA's next high 10 bits (12-21) are offset into this table - Entry contains the physical addr of the page, low 12 bits in VA offset inside frame x86_32 Virtual Memory Misc: - If top level page table (page dir) entry doesn't exist: - No Virtual mapping exists, or - Entire 4MB it maps has been swapped - Page allocation: - Allocate a new page, and set its phys. addr in page table entry - If second level page table (page table) entry doesn't exist: - No Virt. Mapping exists, or - Page (4KB) has been swapped - Freeing a page: - Walk page tables, free physical page - Destroy (zero) second level page table entry - Could also destroy the top level entry IF it maps no other pages - Linux doesn't bother with this - This is all kinda slow: - x86_32 vaddr translation requires 3 mem accesses - Linux keeps special caches for page table pages Translation Lookaside Buffer (TLB): - Caches successful page table walks: - Lines are vaddr x maps to phys addr y - Managed by MMU - Always looked at first: - On hit, great success - On miss, walk page tables - Often Fully Associative, Small, Fast - When to Flush? - When switching address spaces - When updating/deleting page-table entries - Given more and more memory, the size of TLB becomes a problem Improvements on/to TLB: - Translation Caches: - TLB has all vaddr bits - Translation Cache has part of the vaddr bits - Can then do a partial page walk - x86_32: TLB: 32bit tag (index), TCache: 16bit tag - Caching Page Table Pages: - Can store PTEs in large caches to speed up TLB miss - Bigger Pages: - The bigger the pages, fewer mem access per page table walk - Ex. 4MB huge pages reduce the number of access by 1 - Eliminates the bottom page table from having to be accessed Huge Pages: - Large chunks of contiguous, aligned physical memory: - 4MB standard, can be bigger - Performance benefit on TLB miss as described above - Drawbacks: - Potential for lots of wasted space - Complex to manage Page Tables and Security: - Prime targets for attacker include: - Kernel base addr (jump targets) - Physical Memory Map (Linux) - Enter KASLR (Kernel Address Space Randomization): - Randomizes different sections in kernel addr space - Makes it harder to exploit: - Harder to leak pointer - Bruteforce lead to kernel crashes - Entropy: - Generally limited to simplify memory management - Linux Kernel: 6 bits - Windows Kernel: 13 bits - OS X Kernel: 8 bits - Can only be randomized at Boot (rand addr to place kern. at) == 06. User-mode and Interrupts == NB. Operating System's Task: - "The safe and efficient multiplexing of system resources to competing applications" - CPU provides support for safety through privileged instructions - OS uses this support to manage user applications x86 Privilage Seperation: - Uses "Ring" terminology - Ring 0 is kernel, Ring 3 is userland - In geneneral, only 0 and 3 used What's done in Ring 0? - GDT configuration (x86) - Mediation of access to resources: - CPU via per application quotas - Memory via manipulating page tables - Devices (control operation and expose interfaces) x86 Switching Rings: - Done via Segmentation hardware - Last 2 bits of code segment indicate ring - Kernel/User use different segments - Isolation enforced via address spaces - IRET instruction switches to higher rings - INT instruction switches to lower rings Address Spaces and User Apps (processes): - Different page tables create different address spaces - Allows OS to isoloate processes from eachother - Most modern OSs have a 1-to-1 mapping of addr spaces to processes Process Management: - Each process may hold a number of resources (files, memory, etc.) - Privilages of processes may vary Address Space Layout Randomization (ASLR): - Knowing the location of sensitive information (code, data) makes an attacker's job easier. - Place this in random locations instead - Entropy: - Linux mmap provides 28 bits - Varies greatly on windows depending on features/data structure Interrupts Class: - An event that "interrupts" the execution flow of a program - Can be external: - Keyboard taps, network packets - Can be internal: - Divide by Zero, Page Faults - Can be Sync/Async Interrupts: Interrupt: - Hardware Generated: - Device wants kernel to do something - Signal the CPU via setting a pin - Maskable or NonMaskable: - Can I (kernel) ignore it? - Thermal fault vs. NIC got a packet - Software Generated: - Software needs some service from kernel - Eg. open a file, more memory - Done by executing INT (instruction) Interrupts: Exceptions, Traps, Faults: - CPU Generated - Result of unexpected behavior in a CPU system - Eg. divide by zero, hardware errors, mem access w/o pgtables - Sofware faults mostly synchronous - Hardware faults mostly asynchronous Interrupt Routine (what happens?): - Safe transfer of execution to kernel mode - Some user context saved (like eip) - CPU elevates privilage level and switches to the kernel stack - Interrupt handling function called - After handling, CPU restores the saved context, and drops priv. level Interrupt Descriptor Table: - Interrupts may come from different sources, be of different types - CPU needs to know how to handle them - At interrupt, CPU looks up the interrupt in the IDT - Max 256 entries - First 32 entries reserved by Intel for exceptions (page fault etc) - Kernel code seg. defined in IDT, but Kernel stack read from TSS Task Segment Selector (TSS): - A GDT entry: the address of the interrupt stack - In JOS, interrupt stack == top of the kernel stack - Task register points to the GDT entry - LTR instruction loads the correct TSS Returning from Interrupts: - The execution's context has been saved on kernel stack - IRET used to return execution to pre-interrupt location - Pops the right amount of context off stack - Restores the stack (if needed) - Returns to saved EIP (and drop priv. if needed) Problems with Interrupts: - Livelock: execution may slow, or seem to freeze in the presence of many interrupts (interrupt storm) - Solutions: - Defer work - Enter polling mode under high interrupts (ask for update instead of being told) Sofware Interrupts (System Calls): - Kernel support for servicing user applications - Generated by the INT instruction (0x80 in Linux) - Sysenter/Sysexit, Syscall/Sysret - Dispatch routine (syscall hander) just an interrupt handler - Kernel-User interaction per call dictated by calling convention - Ex: INT in linux reads arguments from %eax, $ecx, $edx, etc in order - %eax contains ret value Problems w/ INT: - Modern processors are deeply pipelined - Cache misses can cause pipeline stalls - Pipeline stalls are expensive: - IDT Entry may not be in the cache: - Different permissions constrain instruction reordering - Solution: Sysenter - Cache the IDT entry for a syscall in a special CPU reg - No more cache misses for IDT - Intel added dedicated registers - Assumes that syscalls are frequent enough to be worth extra regs - NB. Ring-3 ESP, EIP not automatically saved by sysenter - Sysenter Pros: - Faster than int - Sysenter Cons: - Programmer needs to do more work - Not all CPUs support it - Passing EIP around can be inefficient - Linux impl: - Kernel supports both Sysenter and Int - Kernel figures out what's available on CPU (cpuid) - Kernel sets up a virtual syscall page w/ best syscall choice - Virtual Dynamic Shared Object: - Originally mapped at fixed user addr - Replaces int 0x80 w/ call <addr> - Supports fixed point return (?) - Can support extra functionality (like gettimeofday) Syscalls - AMD proposal: - Same idea as sysenter, but w/o fixed return point - Adopted by everyone - VDSO only really needed then for extra funcs (like gettimeofday) == 07. Paging == Page Fault: - Special Exception - Purpose: - MMU needs attention - Basis for demand paging - Types: - Invalid: Access to invalid address - Ex. Write to Read only page - Major: Access is valid, but page not in memory - Ex: Page has been swapped - Minor: Access is valid and page is in memory - Ex: First access to anonymous memory - Kernel Faults: - Invalid if kernel address - Major: kernel is paged out (eg. Windows) - Minor: lazy allocation (eg. vmalloc) - User: Everything goes Page Fault Handling: - CR2 register contains faulting addr - Get info on faulting process (get_current()->task) - Get info on task's address space (" "->mm) - Get VMA region info by looking up faulting addr in VMAs - Get PTE info by walking task's page tables Demand Paging: - Only allocates a page frame on first access - Allocation is now just a VMA - mmap permission might cause fame allocation as well - On PF, if faulting addr is covered by a VMA: - Kernel Assigns page frame to task - Kernel create a PTE entry and set it in the tasks PTs - Switches back to task execution - Task is unaware that anything happened VMA Operations: - mmap: - Maps new region in addr space - Flags include: - READ/WRITE/EXECUTE - POPULATE: Immidiately allocate frame - ANONYMOUS: Generic, demand paged (if used) region - Tries to find an accomodating gap in VMAs, and creates new on in that space if found. - Faults new pages in if requested - munmap: - Unmaps a region in the address space - Simply deletes VMA in best case: - Region contains no mapped page frames - Otherwise, needs to: - Remove mapped page frames - Remove any PTEs pointing to these pages - Flush any TLB entries pointing to encompassed pages - mprotect: - Changes a region's protection bits - Simply updates a VMA's protection bits in best case: - Region specified -eq to a single VMA - Region contains no mapped pages - Otherwise, needs to: - Split/Merge VMAs, possibly making a new one to accomdate - Update effected PTEs - madvise: - Gives Kernel adivise on a region: - MADV_DONTNEED: not likely to be used again soon, kernel can free resources associated with region. - MADV_WILLNEED: likely to be used soon, kernel can cache the page(s) to improve latency of future accesses - Used for simple paging policies - More complex policies supported w/ eg. Userfaultfd (UFFD): - User-level page fault handling - Allows arbitrary paging policies in userland: - App faults - Kernel calls UFFD thread with PF info - UFFD handles PF and returns - Potentially fancy polices - App thread resumes - Find and Split: - The above methods work on arbitrary memory ranges - If Range != VMA, we have to do more work - Potential for splitting and merging of VMAs - Above calls may therefor cause multiple splits/merges - Without merging, #VMAs could balloon - VMA lookup will get slower and slower - Solution: - Contigious VMAs with identical properties are merged VMAs and Huge Pages: - Issues: - Requires application awareness - Static limit on number available - Can't swap or mix with regular pages in VMA - Libhugetlbfs: - Overrides mmap, munmap, etc. to allocate huge pages Transparent Huge Pages: - Designed to improve on libhugetlbfs - Idea: Kernel automatically allocates THPs whenever possible (eg. when the alignment is correct) - Applications don't need to be THP aware, however: - An aware application can map with correct alignment - madvise's MADV_NOHUGEPAGES if only going to use a couple bytes - Operations: - Map: - Allocate compound pages and map them as THPs - "Compound Page": group of phys-contiguous pages - When to do it: - At PF, periodically, never - Currently done at PF (may slow PFs) - What situations to do it in: - Mapped Huge Page within owning VMA (assuming aligned) - Huge pmd available (pmd?) - Collapse: - Replace a # of regular pages with THP - When to do it: - PF, periodically, never - Currently done periodically (khugepaged) - Scans N pages every K intervals - How: - Replaces pages with compound page - Replace PTEs with pmd - Split: - Split THP into a # of regular pages - When to do it: - At VMA split(4k), if DONTNEED(4k) set, etc. - Any time THP-unaware kernel code executes - Promted issue with jemalloc - Result: THP not so transparent... - Compact: - Memory compaction s.t. compound pages can be allocated - Needed in presence of high fragmentation - Both Direct (failed alloc) & periodic used in Linux - kcompactd: - Kernel thread for periodic compaction - Direct compaction waits for kcompactd - Used by PF handling code and khugepaged == 08. Multiprocessing == Fork: - Creates a new child process by duplicating the parent - Implemented on top of clone() in Linux - Duplicate Task: - Copy most info, some exceptions, eg. PID - Allocate and initialize new knerel stack - Setup thread_info - Copy trap frame and update (eg. $eax) - Copy Memory: - Initialize empty addr space (new page dir) - Duplicate Addr space: - Copy VMA information - Copy page tables - Fixup PTEs: - Eg. COW page frames Exec: - Executes the program pointed to by filename - Implementation: - Check input & permissions - Load binary headers into memory - Determine binary format - Flush old resources: - Reinit task struct - Flush VMAs, PTs, Page Frames - Load the binary: - Parse headers & Sections - Create VMAs for Data, Text, Stack etc.` - Update %esp, %eip in trap frame: - %esp = top of user stack - %eip = program entry point (for statically linked) - %eip = dynamic linker's entry point otherwise - Page tables initially empty: - Even binary file pages (eg. text) demand paging - PF handler maps them from cache using COW semantics for safe sharing (?) Copy on Write (COW): - perm(VMA) = R/W, perm(PTE) = R - At PF, kernel sees discrepency in permissions - Duplicates page frame, remaps new page into faulting task's page tables w/ R/W perms - Application: - MAP_PRIVATE: file pages - Deduplicates binary pages for unrelated processes - Many (eg. Text pages) never COWed - MAP_PRIVATE: anon forked pages: - Deduplicates pages with process hierarchy - Many (eg. fork, exec) never COWed - MAP_PRIVATE: anon zero pages - Deduplicates zero pages for unrelated processes - At first PF, map single read-only zero page - COW at first write PF - Only used in specific cases (eg. THP) for scalability Inter Process Communication (IPC): - Allows processes to communicate/synchronize with eachother - Many different methods/interfaces - System V vs. POSIX: - Similar: - Shared Memory Region - Semaphores: sync w/ post/wait primitives - Message queues: interprocess message exchange - Shared Memory (Sys. V): - Get/create shmem segment by key/perm - Attach/detach by addr/key - Semaphones (Sys. V): - ... - Message Queues: - Get/create message queue by key, perm - Send/recv. message on message queue by queue id - Blocks if queue is full (send) or empty (recv.) - Queue is size limited - POSIX IPC: - Uses names, not keys - Uses reference counting - Provides thread safety - Shared Memory is file-oriented Scheduling: - How do we schedule multiple processes on a given CPU? - Easy way (JOS): - Hardware raises timer interrupts - Interrupt handler invokes simple round-robin scheduler - It's Preemptive in that tasks get interrupted w/o their permission - No fairness, no priorities, etc - Time Management: - Hardware offers clocks/timers - Exposes counters, and timers or varying precesions - These timers can issue interrupts, and used for scheduling - Clock's include: - tsc: Timestamp counter - hpet: High-precesion event timer - acpi_pm: ACPI power management timer - Non-clock types: - RTC: Real time clock: - Persistent (battery powered) - Low-precesion - Good for maintaining date - Misc: - Jiffies: ticks since startup - xtime: current date/time - Tasks & Scheduling: - Tasks's State: Running, Runnable, Sleeping, etc. - Quantam/time slice: - Max jiffies a task can run on CPU - Set at creation - Decremented per tick - Good enough to ensure fairness for CPU intensive tasks - Priority: - Set at creation - Could be modified at some point - Scheduling policy: - Normal, Batch, Idle -> Completely fair scheduler (?) - FIFO, RR -> Real-time scheduler (?) - Tickless Kernel: - Idea: - Tasks doing little will waste most of their slice - Can improve system if we schedule them off "when done" - Improve Responsiveness at cost of overhead - Create a new timer for "end of quantam" event - Reprogram hardware to tick at next timer to expire - Linux O(1) Scheduler: - Preemptive round-robin scheduler - Once default, now its closer to RT - Maintains N run queues (N priorities) - Strategy: - Find highest pri. queue w/ runnable task - Dequeue first task on that queue - Calc. its time slice (based on its priority), and run - At time up, enqueue it (at its priority level) and repeat - Improvements: - Priorities are adjusted based on 'sleep time' - Sleep time: the amount of time a task does nothing (?) - Priority bonuses for I/O vs. CPU heavy processes - Linux CFS (Completely Fair Scheduling) Scheduler: - O(1) used lots of hard-to-maintain hacks to fairness - Idea (w/ N tasks): - Record how much CPU time each task has been given - Schedule task with biggest delta tot_CPU_time/N - Virtualize time (virtual runtime) to deal with priorities - Increase virtual runtime faster for lower-priority queues - No heuristics to distinguish tasks - No run queues, uses RB tree (indexed by virt. runtime) == 09. Multicore == Kernel on Multicore needs to: - Start/Stop cores - x86 APIC - Deal with consistency/scaling issues - Kernel Locking, partitioning, replication - Schedule work - User/Kernel threads Starting/Stopping Cores: - BIOS hands execution to one core - Called Bootstrapping core (BSP) in Intel lingo - BSP then starts other cores (Application Cores in Intel lingo) - BSP starts APs by sending a special interrupt Advanced Programmable Interrupt Controller: - IRQ: Interrupt Request - Sits inbetween IO devices (ethernet, RTC, Keyboard, etc.) and CPU - Enabling: - Find the MP configuration Table - Set a bit in the spurious-interrupt vector register - Advanced Features include: - Thermal Management, Internal Error Reporting - Interprocessor interrupts (IPI): - Send interrupts to another core - Forward an interrupt - Start another core - Sending IPIs: - Need to know the destination core APIC ID - Read the config table from memory - Need to know the APIC register address - Read the config table from memory - Need to initiate the IPI - Starting APs: - Send an IPI INIT - Resets a given core specified with its APIC ID - Reset = reset of core's subsystems - Send a start-up IPI (SIPI): - Includes an entry point - Stared core begin executing at this entry point Multicore Scaling: - Switching draws lots of power with many cores - Can't keep all the transistors on all the time - At 8nm 50% of chip's transistors need to be off - Solutions: - Lower frequency - Specialized Cores - Turn cores on/off quickly - Core hotplugging: - Slow on commodity OSs - Need to maintain shared state Multicore Consistency/Scalability: - Cores can execute in the kernel concurrently - Need to keep kernel state consistent - eg. Frame allocator, IO devices (like console), scheduler, etc. - Protecting state: - Lock the state (costs performance) - Lock the the state when needed (backward incompatible) - Partition state (underutilization of resources) - Replicate it (stale state) Locks: - Spinlocks: - Pros: easy, fast - Cons: wasted cycles, scalability issues - Mutexes: - Pros: easy, wasteless (core can do something while it waits) - Cons: high latency - Read-copy-update aka RCU - Pros: fast, scalable in most situation - Cons: waste memory, complex - Big Kernel Lock (BKL): - Single spin lock to protect entire kernel - Simple, but only one core can work in kernel at a time - eg. system can't write to disk while webcam is running - Fine-grained Locking: - Partition the BKL - Seperate locks for subsystems - Fine-graining can continue down to structure level - Complex, but allows concurrent kernel execution - Requires same lock ordering everywhere Multikernel: - Replicate the state as a principle - Combats locks not scaling well - Produces a scalable kernel through replication State Replication in Linux: - Partial - Create a copy of hot data structs. per CPU - Each core can access its own structures lock free - Caches per core to alleviate load on certain subsystems Scheduling on Multicore: - Requires multiple threads of execution - User mode apps can spawn threads - Kernel schedules these threads on idle cores - Parallelism for kernel tasks: - Post interrupt work - Background work: - Filling per-CPU frame caches - Writing dirty data to disk Concurrency in Linux: - Interrupts do as little as possible - Deferring work (bottom half) - SoftIRQs - High Priority work (w/ interrupts enabled) - Per CPU bitmap of softIRQs needing work - Kernel checks bitmap after interrupt/periodically - Tasklets - Dynamically create deferred work - Run as part of softIRQs - Serialized (no locking) - Workqueues - Unlike softIRQs/tasklets, these run in process context - Backed by kthreads - Graceful resource sharing and they can sleep - Harder to program compared to tasklets.. - KThreads - Unit of parallelism in Linux kernel - Back the bulk of softIRQs and workqueues - Use the waitqueue mechanism to get notified of work to do == 10. Page Reclaiming == Page Reclaiming techniques: - Cache Shrinking: shrink or drop OS maintained caches - Out-of-Memory (OOM) killing: kill a process to reclaim its pages - Swapping: swap likely unused pages to disk - Compression: compress likely unsed pages in memory - Deduplication: Merge and COW pages with idential content Direct v. Indirect reclamation: - Direct: - Triggered when the kernel fails to allocate memory - Exceptions apply (THP) - Indirect (eg. kswapd): - Triggered when memory pressure is approaching - Free pages to stay below a available memory threshold Cache Shrinking: - Idea: eliminate "harmless" pages first - Page Cache: reclaim pages not referenced by any process - Slab Caches: reclaim pages from slab with no allocated objects - Dentry cache: Reclaim dentries not referenced by any process - Inode cache: Reclaim inode objects with no controlling dentry OOM Killing: - Crude, but: - Very effective way to recover pages - Acceptable for unimportant processes - Last resort - Direct Reclaiming: - Linux OOM Killer - Assigns badness scores and kills worst task - Periodic Reclaiming: - Android's low-memory killer - Similar to OOM killer, targets background apps Page Swapping: - Swap space on disk creates illusion of huge virt. memory (backing store) - Traditional solution for page rec. - Losing Prominence as ram gets bigger/cheaper, IO is slow, etc - Serves some very important features however: - Hibernation on consumer platforms - Database backing on some server platforms - Mechanics: - Swap-out: - At page reclaim, store content in a swap entry on disk - Frame must be unmapped in all owning PTEs - Swap-in: - At PF, read swap entry into a new frame and map it in - Need to locate the swap entry on disk - Requirements: - Reverse mapping: page -> PTEs (swap out) - Direct mapping: PTE -> swap entry (swap in) - Handle races with concurrent swap-ins/outs - Scratch area in the page cache Page Replacement: - How to choose pages to swap? - Need a page replacement algo - Ideal: - Replace page that will be referenced as far in the future as possible - Can be aproximated with LRU-like (Least Recently Used) strategies in practice - FIFO: - Use a queue of faulting pages and replace the head - Costly on frequently accessed pages (?) - Second Chance/CLOCK: - Improved FIFO to preserve important pages - For each visited page: - if R=0, replace it - if R=1, set R=0, and move to tail - R set 1 if accessed - CLOCK uses a circular list and moves head Linux' Page Reclaiming: - Active/Inactive LRUs - Split LRU list into active and inactive LRUs - Both managed CLOCK style - Active pages considered in use (working set) - Inactive pages unmapped (ready for reclaim) - Many per zone LRUs in modern Linux - Active/Inactive Updates: - New anon pages added to active - New file pages added to inactive - Inactive pages -> active on PF - Active pages periodically moved to inactive (kswapd) - What Ratio of active:inactive to maintain? - 1:1 proved not great - Better ratio should: - Keep active list big enough to avoid page faults - keep inactive list to give pages a 2nd chance - Active/Inactive Balancing: - A/I ratio determined adaptively using global "refault distance": - Per-page distance: how much larger should inactive list be to avoid evicting this page? - On Inactive list evection, a distance counter tracks #evictions until page faults again Page Cache: - Stores pages of given owner - The swap cache is a subset - Unified page cache for many purposes: - Hold in transit swap pages - Serve file read/write efficiently - Reverse file mappings - Fundamental Operations: - Lookup: used by PF handler (disk read on miss) - Add: used when fetching page from disk - Delete: used by page reclaiming - N address_space objects, 1 for each owner (?) Reverse Mapping: Page Desc. -> PTEs: - Each page descriptor stores pointer to rmap data struct., indexing all VMAs which may map the page - Given a VMA, we can PT walk via mm->pgd - File pages: - Mapping points to address_space data struct - Uses RB tree to index possibly many VMAs mapping the target page - (Private) Anon. Pages: - Mapping points to anon_vma struct - Unless page is in swap cache Direct Mapping: PTE to Swap Entry: - System maintains a number of swap areas, each with a priority and an array of available entries - Each swap entry maps to a block on the disk at a linear offset - At swap out, PTE is set w/: - Area # - Swap entry offset - At PF, entry lookup on swap cache of disk Page Compression: - Idea: Use CPU for "swapping" rather than a slower I/O sys. - Page out to a compressed memory cache - Can replace or complement swapping - Compression algos include: - LZO (default) - LZ4 (better compression) - Methods include: - ZRAM: Swap Replacement - ZSWAP: Swap complement - ZCACHE: Swap complement - ZRAM: - Uses a compressed RAM disk swap device - Pages are compressed at swap out - Pages are decompressed at swap in - No changes in swap management required - Can work as only swapping method, or next to other devices - ZRAM typically configured as highest priority swap device - Other devices are used when highest prio. device is full - Disadvantages: - Requires manual swap device config - When cache is full, system no longer benefits from it - LRU inversion: Most recently swapped pages go to slow disk - Stalest pages quickly fill ZRAM cache, leaving more frequently accessed pages to be swapped on slower devices (disk) - ZSWAP: - Write-back cache for regular swap devices - Stores compressed pages in zswap pool - Uses zbud allocator (default) to store 2 pages in a compressed page - Swap out non compressible page directly - Compressed pages swapped to disk in LRU fashion when pool reaches pre set size - Greatly reduces disk I/O due to swapping - Disadvantage: - Assume dedicated disk swap device - ZCACHE: - Write-back cache similar to zswap - More ambitious than ZSWAP: - Compressed cache for swapped pages - Compressed cache for clean pages - More complicated, not commonly used - ... Page Deduplication: - Proactively find pages with identical content - Merge them, and COW for safe sharing - Why bother if COW already used for fork/exec? - eg. Virtualization: dedupe anon pages across guests - eg. Special workloads with lots of data duplication - Problems: - Performance: - Proactive reclaiming may eleiminate memory pressure, or just waste cycles - Possible to cause COW storm - Security: - Cache Attacks: enables flush+reload for otherwise unrelated pages in different security domains - Dedup Est Machina: yields "does page exist?" oracle via COW timing - Flip Feng Shui: provides massaging prim. to map a target page into a physical page vulnerable to hardware bit flips Kernel Samepage Merging (KSM): - In-kernel dedup system for priv anon memory - Designed for virtualization, but useful in general - Opt-in Policy: VMA flag allows merging - Mechanics: - Kthread (ksmd) scans P mergable pages every S seconds - Maintains RB tree of merged/unmerged pages indexed by content - At every scan, ksmd walks the tree P times, looking for exact match - For each exact match, ksmd merges the page frames and write-protects owning PTEs - Merged pages can be COWed, or even swapped - Stable Tree: - Contains only already merged pages - Their content cannot change (or COW PF) - RB tree is consistent by design - ksmd can safely walk it for each page looking for exact matches - Unstable Tree: - Contains unmerged page, content changes - Can be inconsistent, lookups can lead to FNs (?) - To mitigate needless walks: - ksmd can rebuild the tree at scan