*Most of these are works in progress
KP NOTES
== 02. Booting x86 ==
At Poweron:
- CPU begins executing BIOS code
- Detect and initialize devices/memory
- Find bootable device
- Read boot sector into memory & start executing it
- Memory Layout:
- Device Memory
- BIOS ROM
- Bootloader loaded into mem below dev. mem/ROM
At boot loader:
- CPU in 'Real' mode
- 20 bit memory addrs (up to 1 meg)
- Segmentation used ie. '0x12:0x1234'
- CPU translates as by doing segment * 16 + addr
Transition to 'Protected' mode:
- Need to setup GDT
- Defines memory areas, protections, tasks...
- Memory Segment Descriptors define memory range + protections
Segment Registers:
- Real Mode: used for addressing
- Prot. Mode: used for GDT offset
- GDT segment entry describes physical memory base & length
- Segmentation not used in most archs, instead they use paging
Protected Mode:
- Bootloader reads the kernel into memory
- Jumps to the kernel's entry func
Entering Kernel:
- Enables Paging
- Memory allocation and protection via virt. addr. spaces
- Initialize global kernel vars (BSS)
- Inits console
- Inits memory
== 03. Managing Physical Memory ==
UMA vs. NUMA:
- Uniform Memory Access (UMA):
- All processors share same latency to memory
- Single Memory for all cores
- Doesn't scale well w/ #cores
- Non-Uniform Memory Access (NUMA):
- Processors have own local memory with fast access times
- Processors can access other memories, but more slowly
- Scales better than UMA
Physical Memory in Linux:
- Logically divided into pages (page frames)
- Nodes:
- NUMA abstraction of N banks
- Zones:
- DMA: 0-16 MB DMA
- NORMAL: 16-896 MB Kernel
- HIGHMEM: 896MB-END Kmap
- Tagged regions for each node
- Pages:
- Physical page frame for each zone
- 4KB units
- page object structs in a memory_map array
Allocators in Linux:
- SLAB Allocator:
- Allows cache creation, which store objects of same size.
- Size can be smaller/bigger than a page
- Can cope with internal fragmentation, ie. entire page used for 10B of data
- VMALLOC Allocator:
- Deals w/ Non-physically contiguous memory
- Page Allocator:
- Deals /w contiguous areas of phys. pages
- MemBlock Allocator
Memblock Allocator:
- Early boot-time allocator
- Discarded after initializing Buddy allocator (linux)
- Maintains two arrays, memory (all mem), and reserved (allocated mem)
- Allocates by finding regions in "mem && !reserved"
Buddy Allocator:
- "Power of 2 allocator with free coalescing"
- Blocks arranged in 2^N (order) pages
- Allocations satisfied by exact N:
- Split larger 2^(N+1) blocks into 2x2^N blocks
- The two resulting blocks are "buddies"
- Deallocations return block to allocator:
- If the returned block's buddy is free, coalesce into larger block
- Keep coalescing until it's no longer possible
- Freelists per order (N) maintain free blocks of that size
Fragmentation:
- External: Inability to service allocations due to too many small blocks
- Addressed in buddy with free coalescing
- Internal: Space wastage due to too many large blocks assigned to small allocations
- Addressed w/ Slab allocators
Memory Errors:
- Out of Bounds Access:
- Accesses outside allocation
- Uninitialized Reads:
- Reads of uninitialized data might lead to previous allocation's data being read
- Use-after-free:
- Access to data after free
- Memory Leak...
Anti-Memory Errors:
- Page Guarding:
- Detects out of bounds access
- Each allocation looks like:
- [Guard Page][Data Page][Guard Page]
- Marks page as guard, allowing detection
- Page Canaries:
- Lazily detects out of bounds access
- Each alloc looks like:
- [Canary Page][Data Page][Canary Page]
- Canary pages contain specific data
- Can lazily detect out of bounds writes by mismatch w/ canary data
- Page Remapping:
- ?
- Page Poisoning:
- Lazily detects write-after-free
- When page is freed, it is written with known value
- Write after free detected by mismatch
- If the memory is resused, ie. written to normally, we can't detect it with page poisoning
Linux Memory Protection "Sanitizers":
- kmemcheck (uinitialized reads):
- Traps on every read to check for initialization
- kmemleak (memory leaks):
- Periodic Garbage collection
- Reports objects not pointed to by any "likely" pointers
- kasan (out of bounds & use after free):
- Combines canaries and poisoning ideas
- Checks each access to see if target is a live object
JOS Boot Memory Allocator:
- Just maintains an index to next free block
- Only used for a few long-lived structures, eg. initial page dir
JOS Page Frame Allocator:
- Simple freelist based allocator
- Allocates a single page at a time
- Maintains free and allocated list
== 05. Page Table Management ==
Address Spaces:
- Provides isolation and protection
- Very flexible
- Comes at the cost of address translation
x86_32 Page Table Entry:
- Describes a page (4KB)
- Format:
Bits: [31 - 12][11 - 9][8 - 0]
Values: [Phys. Addr][Avail.][Flags]
- Flags:
- Global: Prevents TLB from updating line if cr3 reset (eg. at context switch)
- Dirty: Indicates page has been written to
- Accessed: Indicates r/w executed on page
- Cache Disabled: Indicates not to cache this page
- Write Through: Indicates Write-Through caching enabled
- User/Super: Indicates avail to all, else only super
- Read/Write: Indicates R/W page, else only read
- Present: Indicates presence of page in physical memory (eg. not swapped)
- Page Dir Entry vs. Page Table Entry:
- Indentical in form except:
- PgDir: Addr is page table addr
- PgDir: Global Bit ignored
- PgDir: No Dirty Bit
- PgTable: No Page-Size Bit
- In PgDir this indicates huge pages
x86_32 Virtual -> Phyiscal Translation:
- Read cr3: Addr of top level page table (page directory)
- Use high 10 bits of VA as offset into table
- Assuming entry exists, addr stored in entry is addr of page table
- VA's next high 10 bits (12-21) are offset into this table
- Entry contains the physical addr of the page, low 12 bits in VA offset inside frame
x86_32 Virtual Memory Misc:
- If top level page table (page dir) entry doesn't exist:
- No Virtual mapping exists, or
- Entire 4MB it maps has been swapped
- Page allocation:
- Allocate a new page, and set its phys. addr in page table entry
- If second level page table (page table) entry doesn't exist:
- No Virt. Mapping exists, or
- Page (4KB) has been swapped
- Freeing a page:
- Walk page tables, free physical page
- Destroy (zero) second level page table entry
- Could also destroy the top level entry IF it maps no other pages
- Linux doesn't bother with this
- This is all kinda slow:
- x86_32 vaddr translation requires 3 mem accesses
- Linux keeps special caches for page table pages
Translation Lookaside Buffer (TLB):
- Caches successful page table walks:
- Lines are vaddr x maps to phys addr y
- Managed by MMU
- Always looked at first:
- On hit, great success
- On miss, walk page tables
- Often Fully Associative, Small, Fast
- When to Flush?
- When switching address spaces
- When updating/deleting page-table entries
- Given more and more memory, the size of TLB becomes a problem
Improvements on/to TLB:
- Translation Caches:
- TLB has all vaddr bits
- Translation Cache has part of the vaddr bits
- Can then do a partial page walk
- x86_32: TLB: 32bit tag (index), TCache: 16bit tag
- Caching Page Table Pages:
- Can store PTEs in large caches to speed up TLB miss
- Bigger Pages:
- The bigger the pages, fewer mem access per page table walk
- Ex. 4MB huge pages reduce the number of access by 1
- Eliminates the bottom page table from having to be accessed
Huge Pages:
- Large chunks of contiguous, aligned physical memory:
- 4MB standard, can be bigger
- Performance benefit on TLB miss as described above
- Drawbacks:
- Potential for lots of wasted space
- Complex to manage
Page Tables and Security:
- Prime targets for attacker include:
- Kernel base addr (jump targets)
- Physical Memory Map (Linux)
- Enter KASLR (Kernel Address Space Randomization):
- Randomizes different sections in kernel addr space
- Makes it harder to exploit:
- Harder to leak pointer
- Bruteforce lead to kernel crashes
- Entropy:
- Generally limited to simplify memory management
- Linux Kernel: 6 bits
- Windows Kernel: 13 bits
- OS X Kernel: 8 bits
- Can only be randomized at Boot (rand addr to place kern. at)
== 06. User-mode and Interrupts ==
NB. Operating System's Task:
- "The safe and efficient multiplexing of system resources to competing applications"
- CPU provides support for safety through privileged instructions
- OS uses this support to manage user applications
x86 Privilage Seperation:
- Uses "Ring" terminology
- Ring 0 is kernel, Ring 3 is userland
- In geneneral, only 0 and 3 used
What's done in Ring 0?
- GDT configuration (x86)
- Mediation of access to resources:
- CPU via per application quotas
- Memory via manipulating page tables
- Devices (control operation and expose interfaces)
x86 Switching Rings:
- Done via Segmentation hardware
- Last 2 bits of code segment indicate ring
- Kernel/User use different segments
- Isolation enforced via address spaces
- IRET instruction switches to higher rings
- INT instruction switches to lower rings
Address Spaces and User Apps (processes):
- Different page tables create different address spaces
- Allows OS to isoloate processes from eachother
- Most modern OSs have a 1-to-1 mapping of addr spaces to processes
Process Management:
- Each process may hold a number of resources (files, memory, etc.)
- Privilages of processes may vary
Address Space Layout Randomization (ASLR):
- Knowing the location of sensitive information (code, data) makes an attacker's job easier.
- Place this in random locations instead
- Entropy:
- Linux mmap provides 28 bits
- Varies greatly on windows depending on features/data structure
Interrupts Class:
- An event that "interrupts" the execution flow of a program
- Can be external:
- Keyboard taps, network packets
- Can be internal:
- Divide by Zero, Page Faults
- Can be Sync/Async
Interrupts: Interrupt:
- Hardware Generated:
- Device wants kernel to do something
- Signal the CPU via setting a pin
- Maskable or NonMaskable:
- Can I (kernel) ignore it?
- Thermal fault vs. NIC got a packet
- Software Generated:
- Software needs some service from kernel
- Eg. open a file, more memory
- Done by executing INT (instruction)
Interrupts: Exceptions, Traps, Faults:
- CPU Generated
- Result of unexpected behavior in a CPU system
- Eg. divide by zero, hardware errors, mem access w/o pgtables
- Sofware faults mostly synchronous
- Hardware faults mostly asynchronous
Interrupt Routine (what happens?):
- Safe transfer of execution to kernel mode
- Some user context saved (like eip)
- CPU elevates privilage level and switches to the kernel stack
- Interrupt handling function called
- After handling, CPU restores the saved context, and drops priv. level
Interrupt Descriptor Table:
- Interrupts may come from different sources, be of different types
- CPU needs to know how to handle them
- At interrupt, CPU looks up the interrupt in the IDT
- Max 256 entries
- First 32 entries reserved by Intel for exceptions (page fault etc)
- Kernel code seg. defined in IDT, but Kernel stack read from TSS
Task Segment Selector (TSS):
- A GDT entry: the address of the interrupt stack
- In JOS, interrupt stack == top of the kernel stack
- Task register points to the GDT entry
- LTR instruction loads the correct TSS
Returning from Interrupts:
- The execution's context has been saved on kernel stack
- IRET used to return execution to pre-interrupt location
- Pops the right amount of context off stack
- Restores the stack (if needed)
- Returns to saved EIP (and drop priv. if needed)
Problems with Interrupts:
- Livelock: execution may slow, or seem to freeze in the presence of many interrupts (interrupt storm)
- Solutions:
- Defer work
- Enter polling mode under high interrupts (ask for update instead of being told)
Sofware Interrupts (System Calls):
- Kernel support for servicing user applications
- Generated by the INT instruction (0x80 in Linux)
- Sysenter/Sysexit, Syscall/Sysret
- Dispatch routine (syscall hander) just an interrupt handler
- Kernel-User interaction per call dictated by calling convention
- Ex: INT in linux reads arguments from %eax, $ecx, $edx, etc in order
- %eax contains ret value
Problems w/ INT:
- Modern processors are deeply pipelined
- Cache misses can cause pipeline stalls
- Pipeline stalls are expensive:
- IDT Entry may not be in the cache:
- Different permissions constrain instruction reordering
- Solution: Sysenter
- Cache the IDT entry for a syscall in a special CPU reg
- No more cache misses for IDT
- Intel added dedicated registers
- Assumes that syscalls are frequent enough to be worth extra regs
- NB. Ring-3 ESP, EIP not automatically saved by sysenter
- Sysenter Pros:
- Faster than int
- Sysenter Cons:
- Programmer needs to do more work
- Not all CPUs support it
- Passing EIP around can be inefficient
- Linux impl:
- Kernel supports both Sysenter and Int
- Kernel figures out what's available on CPU (cpuid)
- Kernel sets up a virtual syscall page w/ best syscall choice
- Virtual Dynamic Shared Object:
- Originally mapped at fixed user addr
- Replaces int 0x80 w/ call <addr>
- Supports fixed point return (?)
- Can support extra functionality (like gettimeofday)
Syscalls - AMD proposal:
- Same idea as sysenter, but w/o fixed return point
- Adopted by everyone
- VDSO only really needed then for extra funcs (like gettimeofday)
== 07. Paging ==
Page Fault:
- Special Exception
- Purpose:
- MMU needs attention
- Basis for demand paging
- Types:
- Invalid: Access to invalid address
- Ex. Write to Read only page
- Major: Access is valid, but page not in memory
- Ex: Page has been swapped
- Minor: Access is valid and page is in memory
- Ex: First access to anonymous memory
- Kernel Faults:
- Invalid if kernel address
- Major: kernel is paged out (eg. Windows)
- Minor: lazy allocation (eg. vmalloc)
- User: Everything goes
Page Fault Handling:
- CR2 register contains faulting addr
- Get info on faulting process (get_current()->task)
- Get info on task's address space (" "->mm)
- Get VMA region info by looking up faulting addr in VMAs
- Get PTE info by walking task's page tables
Demand Paging:
- Only allocates a page frame on first access
- Allocation is now just a VMA
- mmap permission might cause fame allocation as well
- On PF, if faulting addr is covered by a VMA:
- Kernel Assigns page frame to task
- Kernel create a PTE entry and set it in the tasks PTs
- Switches back to task execution
- Task is unaware that anything happened
VMA Operations:
- mmap:
- Maps new region in addr space
- Flags include:
- READ/WRITE/EXECUTE
- POPULATE: Immidiately allocate frame
- ANONYMOUS: Generic, demand paged (if used) region
- Tries to find an accomodating gap in VMAs, and creates new on in that space if found.
- Faults new pages in if requested
- munmap:
- Unmaps a region in the address space
- Simply deletes VMA in best case:
- Region contains no mapped page frames
- Otherwise, needs to:
- Remove mapped page frames
- Remove any PTEs pointing to these pages
- Flush any TLB entries pointing to encompassed pages
- mprotect:
- Changes a region's protection bits
- Simply updates a VMA's protection bits in best case:
- Region specified -eq to a single VMA
- Region contains no mapped pages
- Otherwise, needs to:
- Split/Merge VMAs, possibly making a new one to accomdate
- Update effected PTEs
- madvise:
- Gives Kernel adivise on a region:
- MADV_DONTNEED: not likely to be used again soon, kernel can free resources associated with region.
- MADV_WILLNEED: likely to be used soon, kernel can cache the page(s) to improve latency of future accesses
- Used for simple paging policies
- More complex policies supported w/ eg. Userfaultfd (UFFD):
- User-level page fault handling
- Allows arbitrary paging policies in userland:
- App faults
- Kernel calls UFFD thread with PF info
- UFFD handles PF and returns
- Potentially fancy polices
- App thread resumes
- Find and Split:
- The above methods work on arbitrary memory ranges
- If Range != VMA, we have to do more work
- Potential for splitting and merging of VMAs
- Above calls may therefor cause multiple splits/merges
- Without merging, #VMAs could balloon
- VMA lookup will get slower and slower
- Solution:
- Contigious VMAs with identical properties are merged
VMAs and Huge Pages:
- Issues:
- Requires application awareness
- Static limit on number available
- Can't swap or mix with regular pages in VMA
- Libhugetlbfs:
- Overrides mmap, munmap, etc. to allocate huge pages
Transparent Huge Pages:
- Designed to improve on libhugetlbfs
- Idea: Kernel automatically allocates THPs whenever possible (eg. when the alignment is correct)
- Applications don't need to be THP aware, however:
- An aware application can map with correct alignment
- madvise's MADV_NOHUGEPAGES if only going to use a couple bytes
- Operations:
- Map:
- Allocate compound pages and map them as THPs
- "Compound Page": group of phys-contiguous pages
- When to do it:
- At PF, periodically, never
- Currently done at PF (may slow PFs)
- What situations to do it in:
- Mapped Huge Page within owning VMA (assuming aligned)
- Huge pmd available (pmd?)
- Collapse:
- Replace a # of regular pages with THP
- When to do it:
- PF, periodically, never
- Currently done periodically (khugepaged)
- Scans N pages every K intervals
- How:
- Replaces pages with compound page
- Replace PTEs with pmd
- Split:
- Split THP into a # of regular pages
- When to do it:
- At VMA split(4k), if DONTNEED(4k) set, etc.
- Any time THP-unaware kernel code executes
- Promted issue with jemalloc
- Result: THP not so transparent...
- Compact:
- Memory compaction s.t. compound pages can be allocated
- Needed in presence of high fragmentation
- Both Direct (failed alloc) & periodic used in Linux
- kcompactd:
- Kernel thread for periodic compaction
- Direct compaction waits for kcompactd
- Used by PF handling code and khugepaged
== 08. Multiprocessing ==
Fork:
- Creates a new child process by duplicating the parent
- Implemented on top of clone() in Linux
- Duplicate Task:
- Copy most info, some exceptions, eg. PID
- Allocate and initialize new knerel stack
- Setup thread_info
- Copy trap frame and update (eg. $eax)
- Copy Memory:
- Initialize empty addr space (new page dir)
- Duplicate Addr space:
- Copy VMA information
- Copy page tables
- Fixup PTEs:
- Eg. COW page frames
Exec:
- Executes the program pointed to by filename
- Implementation:
- Check input & permissions
- Load binary headers into memory
- Determine binary format
- Flush old resources:
- Reinit task struct
- Flush VMAs, PTs, Page Frames
- Load the binary:
- Parse headers & Sections
- Create VMAs for Data, Text, Stack etc.`
- Update %esp, %eip in trap frame:
- %esp = top of user stack
- %eip = program entry point (for statically linked)
- %eip = dynamic linker's entry point otherwise
- Page tables initially empty:
- Even binary file pages (eg. text) demand paging
- PF handler maps them from cache using COW semantics for safe sharing (?)
Copy on Write (COW):
- perm(VMA) = R/W, perm(PTE) = R
- At PF, kernel sees discrepency in permissions
- Duplicates page frame, remaps new page into faulting task's page tables w/ R/W perms
- Application:
- MAP_PRIVATE: file pages
- Deduplicates binary pages for unrelated processes
- Many (eg. Text pages) never COWed
- MAP_PRIVATE: anon forked pages:
- Deduplicates pages with process hierarchy
- Many (eg. fork, exec) never COWed
- MAP_PRIVATE: anon zero pages
- Deduplicates zero pages for unrelated processes
- At first PF, map single read-only zero page
- COW at first write PF
- Only used in specific cases (eg. THP) for scalability
Inter Process Communication (IPC):
- Allows processes to communicate/synchronize with eachother
- Many different methods/interfaces
- System V vs. POSIX:
- Similar:
- Shared Memory Region
- Semaphores: sync w/ post/wait primitives
- Message queues: interprocess message exchange
- Shared Memory (Sys. V):
- Get/create shmem segment by key/perm
- Attach/detach by addr/key
- Semaphones (Sys. V):
- ...
- Message Queues:
- Get/create message queue by key, perm
- Send/recv. message on message queue by queue id
- Blocks if queue is full (send) or empty (recv.)
- Queue is size limited
- POSIX IPC:
- Uses names, not keys
- Uses reference counting
- Provides thread safety
- Shared Memory is file-oriented
Scheduling:
- How do we schedule multiple processes on a given CPU?
- Easy way (JOS):
- Hardware raises timer interrupts
- Interrupt handler invokes simple round-robin scheduler
- It's Preemptive in that tasks get interrupted w/o their permission
- No fairness, no priorities, etc
- Time Management:
- Hardware offers clocks/timers
- Exposes counters, and timers or varying precesions
- These timers can issue interrupts, and used for scheduling
- Clock's include:
- tsc: Timestamp counter
- hpet: High-precesion event timer
- acpi_pm: ACPI power management timer
- Non-clock types:
- RTC: Real time clock:
- Persistent (battery powered)
- Low-precesion
- Good for maintaining date
- Misc:
- Jiffies: ticks since startup
- xtime: current date/time
- Tasks & Scheduling:
- Tasks's State: Running, Runnable, Sleeping, etc.
- Quantam/time slice:
- Max jiffies a task can run on CPU
- Set at creation
- Decremented per tick
- Good enough to ensure fairness for CPU intensive tasks
- Priority:
- Set at creation
- Could be modified at some point
- Scheduling policy:
- Normal, Batch, Idle -> Completely fair scheduler (?)
- FIFO, RR -> Real-time scheduler (?)
- Tickless Kernel:
- Idea:
- Tasks doing little will waste most of their slice
- Can improve system if we schedule them off "when done"
- Improve Responsiveness at cost of overhead
- Create a new timer for "end of quantam" event
- Reprogram hardware to tick at next timer to expire
- Linux O(1) Scheduler:
- Preemptive round-robin scheduler
- Once default, now its closer to RT
- Maintains N run queues (N priorities)
- Strategy:
- Find highest pri. queue w/ runnable task
- Dequeue first task on that queue
- Calc. its time slice (based on its priority), and run
- At time up, enqueue it (at its priority level) and repeat
- Improvements:
- Priorities are adjusted based on 'sleep time'
- Sleep time: the amount of time a task does nothing (?)
- Priority bonuses for I/O vs. CPU heavy processes
- Linux CFS (Completely Fair Scheduling) Scheduler:
- O(1) used lots of hard-to-maintain hacks to fairness
- Idea (w/ N tasks):
- Record how much CPU time each task has been given
- Schedule task with biggest delta tot_CPU_time/N
- Virtualize time (virtual runtime) to deal with priorities
- Increase virtual runtime faster for lower-priority queues
- No heuristics to distinguish tasks
- No run queues, uses RB tree (indexed by virt. runtime)
== 09. Multicore ==
Kernel on Multicore needs to:
- Start/Stop cores
- x86 APIC
- Deal with consistency/scaling issues
- Kernel Locking, partitioning, replication
- Schedule work
- User/Kernel threads
Starting/Stopping Cores:
- BIOS hands execution to one core
- Called Bootstrapping core (BSP) in Intel lingo
- BSP then starts other cores (Application Cores in Intel lingo)
- BSP starts APs by sending a special interrupt
Advanced Programmable Interrupt Controller:
- IRQ: Interrupt Request
- Sits inbetween IO devices (ethernet, RTC, Keyboard, etc.) and CPU
- Enabling:
- Find the MP configuration Table
- Set a bit in the spurious-interrupt vector register
- Advanced Features include:
- Thermal Management, Internal Error Reporting
- Interprocessor interrupts (IPI):
- Send interrupts to another core
- Forward an interrupt
- Start another core
- Sending IPIs:
- Need to know the destination core APIC ID
- Read the config table from memory
- Need to know the APIC register address
- Read the config table from memory
- Need to initiate the IPI
- Starting APs:
- Send an IPI INIT
- Resets a given core specified with its APIC ID
- Reset = reset of core's subsystems
- Send a start-up IPI (SIPI):
- Includes an entry point
- Stared core begin executing at this entry point
Multicore Scaling:
- Switching draws lots of power with many cores
- Can't keep all the transistors on all the time
- At 8nm 50% of chip's transistors need to be off
- Solutions:
- Lower frequency
- Specialized Cores
- Turn cores on/off quickly
- Core hotplugging:
- Slow on commodity OSs
- Need to maintain shared state
Multicore Consistency/Scalability:
- Cores can execute in the kernel concurrently
- Need to keep kernel state consistent
- eg. Frame allocator, IO devices (like console), scheduler, etc.
- Protecting state:
- Lock the state (costs performance)
- Lock the the state when needed (backward incompatible)
- Partition state (underutilization of resources)
- Replicate it (stale state)
Locks:
- Spinlocks:
- Pros: easy, fast
- Cons: wasted cycles, scalability issues
- Mutexes:
- Pros: easy, wasteless (core can do something while it waits)
- Cons: high latency
- Read-copy-update aka RCU
- Pros: fast, scalable in most situation
- Cons: waste memory, complex
- Big Kernel Lock (BKL):
- Single spin lock to protect entire kernel
- Simple, but only one core can work in kernel at a time
- eg. system can't write to disk while webcam is running
- Fine-grained Locking:
- Partition the BKL
- Seperate locks for subsystems
- Fine-graining can continue down to structure level
- Complex, but allows concurrent kernel execution
- Requires same lock ordering everywhere
Multikernel:
- Replicate the state as a principle
- Combats locks not scaling well
- Produces a scalable kernel through replication
State Replication in Linux:
- Partial
- Create a copy of hot data structs. per CPU
- Each core can access its own structures lock free
- Caches per core to alleviate load on certain subsystems
Scheduling on Multicore:
- Requires multiple threads of execution
- User mode apps can spawn threads
- Kernel schedules these threads on idle cores
- Parallelism for kernel tasks:
- Post interrupt work
- Background work:
- Filling per-CPU frame caches
- Writing dirty data to disk
Concurrency in Linux:
- Interrupts do as little as possible
- Deferring work (bottom half)
- SoftIRQs
- High Priority work (w/ interrupts enabled)
- Per CPU bitmap of softIRQs needing work
- Kernel checks bitmap after interrupt/periodically
- Tasklets
- Dynamically create deferred work
- Run as part of softIRQs
- Serialized (no locking)
- Workqueues
- Unlike softIRQs/tasklets, these run in process context
- Backed by kthreads
- Graceful resource sharing and they can sleep
- Harder to program compared to tasklets..
- KThreads
- Unit of parallelism in Linux kernel
- Back the bulk of softIRQs and workqueues
- Use the waitqueue mechanism to get notified of work to do
== 10. Page Reclaiming ==
Page Reclaiming techniques:
- Cache Shrinking: shrink or drop OS maintained caches
- Out-of-Memory (OOM) killing: kill a process to reclaim its pages
- Swapping: swap likely unused pages to disk
- Compression: compress likely unsed pages in memory
- Deduplication: Merge and COW pages with idential content
Direct v. Indirect reclamation:
- Direct:
- Triggered when the kernel fails to allocate memory
- Exceptions apply (THP)
- Indirect (eg. kswapd):
- Triggered when memory pressure is approaching
- Free pages to stay below a available memory threshold
Cache Shrinking:
- Idea: eliminate "harmless" pages first
- Page Cache: reclaim pages not referenced by any process
- Slab Caches: reclaim pages from slab with no allocated objects
- Dentry cache: Reclaim dentries not referenced by any process
- Inode cache: Reclaim inode objects with no controlling dentry
OOM Killing:
- Crude, but:
- Very effective way to recover pages
- Acceptable for unimportant processes
- Last resort
- Direct Reclaiming:
- Linux OOM Killer
- Assigns badness scores and kills worst task
- Periodic Reclaiming:
- Android's low-memory killer
- Similar to OOM killer, targets background apps
Page Swapping:
- Swap space on disk creates illusion of huge virt. memory (backing store)
- Traditional solution for page rec.
- Losing Prominence as ram gets bigger/cheaper, IO is slow, etc
- Serves some very important features however:
- Hibernation on consumer platforms
- Database backing on some server platforms
- Mechanics:
- Swap-out:
- At page reclaim, store content in a swap entry on disk
- Frame must be unmapped in all owning PTEs
- Swap-in:
- At PF, read swap entry into a new frame and map it in
- Need to locate the swap entry on disk
- Requirements:
- Reverse mapping: page -> PTEs (swap out)
- Direct mapping: PTE -> swap entry (swap in)
- Handle races with concurrent swap-ins/outs
- Scratch area in the page cache
Page Replacement:
- How to choose pages to swap?
- Need a page replacement algo
- Ideal:
- Replace page that will be referenced as far in the future as possible
- Can be aproximated with LRU-like (Least Recently Used) strategies in practice
- FIFO:
- Use a queue of faulting pages and replace the head
- Costly on frequently accessed pages (?)
- Second Chance/CLOCK:
- Improved FIFO to preserve important pages
- For each visited page:
- if R=0, replace it
- if R=1, set R=0, and move to tail
- R set 1 if accessed
- CLOCK uses a circular list and moves head
Linux' Page Reclaiming:
- Active/Inactive LRUs
- Split LRU list into active and inactive LRUs
- Both managed CLOCK style
- Active pages considered in use (working set)
- Inactive pages unmapped (ready for reclaim)
- Many per zone LRUs in modern Linux
- Active/Inactive Updates:
- New anon pages added to active
- New file pages added to inactive
- Inactive pages -> active on PF
- Active pages periodically moved to inactive (kswapd)
- What Ratio of active:inactive to maintain?
- 1:1 proved not great
- Better ratio should:
- Keep active list big enough to avoid page faults
- keep inactive list to give pages a 2nd chance
- Active/Inactive Balancing:
- A/I ratio determined adaptively using global "refault distance":
- Per-page distance: how much larger should inactive list be to avoid evicting this page?
- On Inactive list evection, a distance counter tracks #evictions until page faults again
Page Cache:
- Stores pages of given owner
- The swap cache is a subset
- Unified page cache for many purposes:
- Hold in transit swap pages
- Serve file read/write efficiently
- Reverse file mappings
- Fundamental Operations:
- Lookup: used by PF handler (disk read on miss)
- Add: used when fetching page from disk
- Delete: used by page reclaiming
- N address_space objects, 1 for each owner (?)
Reverse Mapping: Page Desc. -> PTEs:
- Each page descriptor stores pointer to rmap data struct., indexing all VMAs which may map the page
- Given a VMA, we can PT walk via mm->pgd
- File pages:
- Mapping points to address_space data struct
- Uses RB tree to index possibly many VMAs mapping the target page
- (Private) Anon. Pages:
- Mapping points to anon_vma struct
- Unless page is in swap cache
Direct Mapping: PTE to Swap Entry:
- System maintains a number of swap areas, each with a priority and an array of available entries
- Each swap entry maps to a block on the disk at a linear offset
- At swap out, PTE is set w/:
- Area #
- Swap entry offset
- At PF, entry lookup on swap cache of disk
Page Compression:
- Idea: Use CPU for "swapping" rather than a slower I/O sys.
- Page out to a compressed memory cache
- Can replace or complement swapping
- Compression algos include:
- LZO (default)
- LZ4 (better compression)
- Methods include:
- ZRAM: Swap Replacement
- ZSWAP: Swap complement
- ZCACHE: Swap complement
- ZRAM:
- Uses a compressed RAM disk swap device
- Pages are compressed at swap out
- Pages are decompressed at swap in
- No changes in swap management required
- Can work as only swapping method, or next to other devices
- ZRAM typically configured as highest priority swap device
- Other devices are used when highest prio. device is full
- Disadvantages:
- Requires manual swap device config
- When cache is full, system no longer benefits from it
- LRU inversion: Most recently swapped pages go to slow disk
- Stalest pages quickly fill ZRAM cache, leaving more frequently accessed pages to be swapped on slower devices (disk)
- ZSWAP:
- Write-back cache for regular swap devices
- Stores compressed pages in zswap pool
- Uses zbud allocator (default) to store 2 pages in a compressed page
- Swap out non compressible page directly
- Compressed pages swapped to disk in LRU fashion when pool reaches pre set size
- Greatly reduces disk I/O due to swapping
- Disadvantage:
- Assume dedicated disk swap device
- ZCACHE:
- Write-back cache similar to zswap
- More ambitious than ZSWAP:
- Compressed cache for swapped pages
- Compressed cache for clean pages
- More complicated, not commonly used
- ...
Page Deduplication:
- Proactively find pages with identical content
- Merge them, and COW for safe sharing
- Why bother if COW already used for fork/exec?
- eg. Virtualization: dedupe anon pages across guests
- eg. Special workloads with lots of data duplication
- Problems:
- Performance:
- Proactive reclaiming may eleiminate memory pressure, or just waste cycles
- Possible to cause COW storm
- Security:
- Cache Attacks: enables flush+reload for otherwise unrelated pages in different security domains
- Dedup Est Machina: yields "does page exist?" oracle via COW timing
- Flip Feng Shui: provides massaging prim. to map a target page into a physical page vulnerable to hardware bit flips
Kernel Samepage Merging (KSM):
- In-kernel dedup system for priv anon memory
- Designed for virtualization, but useful in general
- Opt-in Policy: VMA flag allows merging
- Mechanics:
- Kthread (ksmd) scans P mergable pages every S seconds
- Maintains RB tree of merged/unmerged pages indexed by content
- At every scan, ksmd walks the tree P times, looking for exact match
- For each exact match, ksmd merges the page frames and write-protects owning PTEs
- Merged pages can be COWed, or even swapped
- Stable Tree:
- Contains only already merged pages
- Their content cannot change (or COW PF)
- RB tree is consistent by design
- ksmd can safely walk it for each page looking for exact matches
- Unstable Tree:
- Contains unmerged page, content changes
- Can be inconsistent, lookups can lead to FNs (?)
- To mitigate needless walks:
- ksmd can rebuild the tree at scan