Notes

Computer Science

Economics

Other

*Most of these are works in progress

KP NOTES


== 02. Booting x86 ==

At Poweron:
	- CPU begins executing BIOS code
	- Detect and initialize devices/memory
	- Find bootable device
	- Read boot sector into memory & start executing it

	- Memory Layout:
		- Device Memory
		- BIOS ROM
		- Bootloader loaded into mem below dev. mem/ROM

At boot loader:
	- CPU in 'Real' mode
	- 20 bit memory addrs (up to 1 meg)
	- Segmentation used ie. '0x12:0x1234'
		- CPU translates as by doing segment * 16 + addr

Transition to 'Protected' mode:
	- Need to setup GDT
		- Defines memory areas, protections, tasks...
		- Memory Segment Descriptors define memory range + protections

Segment Registers:
	- Real Mode: used for addressing
	- Prot. Mode: used for GDT offset
	- GDT segment entry describes physical memory base & length
	- Segmentation not used in most archs, instead they use paging

Protected Mode:
	- Bootloader reads the kernel into memory
	- Jumps to the kernel's entry func

Entering Kernel:
	- Enables Paging
		- Memory allocation and protection via virt. addr. spaces
	- Initialize global kernel vars (BSS)
	- Inits console
	- Inits memory

== 03. Managing Physical Memory ==

UMA vs. NUMA:
	- Uniform Memory Access (UMA):
		- All processors share same latency to memory
		- Single Memory for all cores
		- Doesn't scale well w/ #cores
	- Non-Uniform Memory Access (NUMA):
		- Processors have own local memory with fast access times
		- Processors can access other memories, but more slowly
		- Scales better than UMA

Physical Memory in Linux:
	- Logically divided into pages (page frames)
	- Nodes:
		- NUMA abstraction of N banks
	- Zones:
		- DMA: 		0-16 MB		DMA
		- NORMAL:	16-896 MB	Kernel
		- HIGHMEM:	896MB-END	Kmap
		- Tagged regions for each node
	- Pages:
		- Physical page frame for each zone
		- 4KB units
		- page object structs in a memory_map array

Allocators in Linux:
	- SLAB Allocator: 
		- Allows cache creation, which store objects of same size.
		- Size can be smaller/bigger than a page
		- Can cope with internal fragmentation, ie. entire page used for 10B of data
	- VMALLOC Allocator:
		- Deals w/ Non-physically contiguous memory
	- Page Allocator:
		- Deals /w contiguous areas of phys. pages
	- MemBlock Allocator


Memblock Allocator:
	- Early boot-time allocator
	- Discarded after initializing Buddy allocator (linux)
	- Maintains two arrays, memory (all mem), and reserved (allocated mem)
	- Allocates by finding regions in "mem && !reserved"

Buddy Allocator:
	- "Power of 2 allocator with free coalescing"
	- Blocks arranged in 2^N (order) pages
	- Allocations satisfied by exact N:
		- Split larger 2^(N+1) blocks into 2x2^N blocks
		- The two resulting blocks are "buddies"
	- Deallocations return block to allocator:
		- If the returned block's buddy is free, coalesce into larger block
		- Keep coalescing until it's no longer possible
	- Freelists per order (N) maintain free blocks of that size

Fragmentation:
	- External: Inability to service allocations due to too many small blocks
		- Addressed in buddy with free coalescing
	- Internal: Space wastage due to too many large blocks assigned to small allocations
		- Addressed w/ Slab allocators


Memory Errors:
	- Out of Bounds Access:
		- Accesses outside allocation
	- Uninitialized Reads:
		- Reads of uninitialized data might lead to previous allocation's data being read
	- Use-after-free:
		- Access to data after free
	- Memory Leak...

Anti-Memory Errors:
	- Page Guarding:
		- Detects out of bounds access
		- Each allocation looks like:
			- [Guard Page][Data Page][Guard Page]
		- Marks page as guard, allowing detection
	- Page Canaries:
		- Lazily detects out of bounds access
		- Each alloc looks like:
			- [Canary Page][Data Page][Canary Page]
		- Canary pages contain specific data
		- Can lazily detect out of bounds writes by mismatch w/ canary data

	- Page Remapping:
		- ?

	- Page Poisoning:
		- Lazily detects write-after-free
		- When page is freed, it is written with known value
		- Write after free detected by mismatch
		- If the memory is resused, ie. written to normally, we can't detect it with page poisoning

Linux Memory Protection "Sanitizers":
	- kmemcheck (uinitialized reads):
		- Traps on every read to check for initialization
	- kmemleak (memory leaks):
		- Periodic Garbage collection
		- Reports objects not pointed to by any "likely" pointers
	- kasan (out of bounds & use after free):
		- Combines canaries and poisoning ideas
		- Checks each access to see if target is a live object

JOS Boot Memory Allocator:
	- Just maintains an index to next free block
	- Only used for a few long-lived structures, eg. initial page dir

JOS Page Frame Allocator:
	- Simple freelist based allocator
	- Allocates a single page at a time
	- Maintains free and allocated list




== 05. Page Table Management ==

Address Spaces:
	- Provides isolation and protection
	- Very flexible
	- Comes at the cost of address translation

x86_32 Page Table Entry:
	- Describes a page (4KB)
	- Format:
		Bits:	[31	 - 	 12][11 - 9][8 - 0]
		Values: [Phys. Addr][Avail.][Flags]
	- Flags:
		- Global: Prevents TLB from updating line if cr3 reset (eg. at context switch)
		- Dirty: Indicates page has been written to
		- Accessed: Indicates r/w executed on page
		- Cache Disabled: Indicates not to cache this page
		- Write Through: Indicates Write-Through caching enabled
		- User/Super: Indicates avail to all, else only super
		- Read/Write: Indicates R/W page, else only read
		- Present: Indicates presence of page in physical memory (eg. not swapped)

	- Page Dir Entry vs. Page Table Entry:
		- Indentical in form except:
			- PgDir: Addr is page table addr
			- PgDir: Global Bit ignored
			- PgDir: No Dirty Bit
			- PgTable: No Page-Size Bit
				- In PgDir this indicates huge pages

x86_32 Virtual -> Phyiscal Translation:
	- Read cr3: Addr of top level page table (page directory)
	- Use high 10 bits of VA as offset into table
	- Assuming entry exists, addr stored in entry is addr of page table
	- VA's next high 10 bits (12-21) are offset into this table
	- Entry contains the physical addr of the page, low 12 bits in VA offset inside frame

x86_32 Virtual Memory Misc:
	- If top level page table (page dir) entry doesn't exist:
		- No Virtual mapping exists, or
		- Entire 4MB it maps has been swapped
	- Page allocation:
		- Allocate a new page, and set its phys. addr in page table entry
	- If second level page table (page table) entry doesn't exist:
		- No Virt. Mapping exists, or
		- Page (4KB) has been swapped
	- Freeing a page:
		- Walk page tables, free physical page
		- Destroy (zero) second level page table entry
		- Could also destroy the top level entry IF it maps no other pages
			- Linux doesn't bother with this

	- This is all kinda slow:
		- x86_32 vaddr translation requires 3 mem accesses
	- Linux keeps special caches for page table pages

Translation Lookaside Buffer (TLB):
	- Caches successful page table walks:
		- Lines are vaddr x maps to phys addr y
	- Managed by MMU
	- Always looked at first:
		- On hit, great success
		- On miss, walk page tables
	- Often Fully Associative, Small, Fast
	- When to Flush?
		- When switching address spaces
		- When updating/deleting page-table entries
	- Given more and more memory, the size of TLB becomes a problem

Improvements on/to TLB:
	- Translation Caches:
		- TLB has all vaddr bits
		- Translation Cache has part of the vaddr bits
		- Can then do a partial page walk
		- x86_32: TLB: 32bit tag (index), TCache: 16bit tag
	- Caching Page Table Pages:
		- Can store PTEs in large caches to speed up TLB miss
	- Bigger Pages:
		- The bigger the pages, fewer mem access per page table walk
		- Ex. 4MB huge pages reduce the number of access by 1
			- Eliminates the bottom page table from having to be accessed

Huge Pages:
	- Large chunks of contiguous, aligned physical memory:
	- 4MB standard, can be bigger
	- Performance benefit on TLB miss as described above
	- Drawbacks:
		- Potential for lots of wasted space
		- Complex to manage

Page Tables and Security:
	- Prime targets for attacker include:
		- Kernel base addr (jump targets)
		- Physical Memory Map (Linux)
	- Enter KASLR (Kernel Address Space Randomization):
		- Randomizes different sections in kernel addr space
		- Makes it harder to exploit:
			- Harder to leak pointer
			- Bruteforce lead to kernel crashes
		- Entropy:
			- Generally limited to simplify memory management
			- Linux Kernel: 6 bits
			- Windows Kernel: 13 bits
			- OS X Kernel: 8 bits
		- Can only be randomized at Boot (rand addr to place kern. at)


== 06. User-mode and Interrupts ==

NB. Operating System's Task:
	- "The safe and efficient multiplexing of system resources to competing applications"
	- CPU provides support for safety through privileged instructions
	- OS uses this support to manage user applications

x86 Privilage Seperation:
	- Uses "Ring" terminology
	- Ring 0 is kernel, Ring 3 is userland
	- In geneneral, only 0 and 3 used

What's done in Ring 0?
	- GDT configuration (x86)
	- Mediation of access to resources:
		- CPU via per application quotas
		- Memory via manipulating page tables
		- Devices (control operation and expose interfaces)

x86 Switching Rings:
	- Done via Segmentation hardware
	- Last 2 bits of code segment indicate ring
		- Kernel/User use different segments
	- Isolation enforced via address spaces
	- IRET instruction switches to higher rings
	- INT instruction switches to lower rings

Address Spaces and User Apps (processes):
	- Different page tables create different address spaces
	- Allows OS to isoloate processes from eachother
	- Most modern OSs have a 1-to-1 mapping of addr spaces to processes

Process Management:
	- Each process may hold a number of resources (files, memory, etc.)
	- Privilages of processes may vary

Address Space Layout Randomization (ASLR):
	- Knowing the location of sensitive information (code, data) makes an attacker's job easier.
	- Place this in random locations instead
	- Entropy:
		- Linux mmap provides 28 bits
		- Varies greatly on windows depending on features/data structure

Interrupts Class:
	- An event that "interrupts" the execution flow of a program
	- Can be external:
		- Keyboard taps, network packets
	- Can be internal:
		- Divide by Zero, Page Faults
	- Can be Sync/Async

Interrupts: Interrupt:
	- Hardware Generated:
		- Device wants kernel to do something
		- Signal the CPU via setting a pin
		- Maskable or NonMaskable:
			- Can I (kernel) ignore it?
			- Thermal fault vs. NIC got a packet
	- Software Generated:
		- Software needs some service from kernel
			- Eg. open a file, more memory
		- Done by executing INT (instruction)

Interrupts: Exceptions, Traps, Faults:
	- CPU Generated
	- Result of unexpected behavior in a CPU system
	- Eg. divide by zero, hardware errors, mem access w/o pgtables
	- Sofware faults mostly synchronous
	- Hardware faults mostly asynchronous

Interrupt Routine (what happens?):
	- Safe transfer of execution to kernel mode
	- Some user context saved (like eip)
	- CPU elevates privilage level and switches to the kernel stack
	- Interrupt handling function called
	- After handling, CPU restores the saved context, and drops priv. level

Interrupt Descriptor Table:
	- Interrupts may come from different sources, be of different types
	- CPU needs to know how to handle them
	- At interrupt, CPU looks up the interrupt in the IDT
	- Max 256 entries
	- First 32 entries reserved by Intel for exceptions (page fault etc)
	- Kernel code seg. defined in IDT, but Kernel stack read from TSS

Task Segment Selector (TSS):
	- A GDT entry: the address of the interrupt stack
		- In JOS, interrupt stack == top of the kernel stack
	- Task register points to the GDT entry
	- LTR instruction loads the correct TSS

Returning from Interrupts:
	- The execution's context has been saved on kernel stack
	- IRET used to return execution to pre-interrupt location
		- Pops the right amount of context off stack
		- Restores the stack (if needed)
		- Returns to saved EIP (and drop priv. if needed)

Problems with Interrupts:
	- Livelock: execution may slow, or seem to freeze in the presence of many interrupts (interrupt storm)
	- Solutions:
		- Defer work
		- Enter polling mode under high interrupts (ask for update instead of being told)

Sofware Interrupts (System Calls):
	- Kernel support for servicing user applications
	- Generated by the INT instruction (0x80 in Linux)
	- Sysenter/Sysexit, Syscall/Sysret
	- Dispatch routine (syscall hander) just an interrupt handler
	- Kernel-User interaction per call dictated by calling convention
	- Ex: INT in linux reads arguments from %eax, $ecx, $edx, etc in order
		- %eax contains ret value

Problems w/ INT:
	- Modern processors are deeply pipelined
	- Cache misses can cause pipeline stalls
	- Pipeline stalls are expensive:
		- IDT Entry may not be in the cache:
		- Different permissions constrain instruction reordering

	- Solution: Sysenter
		- Cache the IDT entry for a syscall in a special CPU reg
		- No more cache misses for IDT
		- Intel added dedicated registers
		- Assumes that syscalls are frequent enough to be worth extra regs
		- NB. Ring-3 ESP, EIP not automatically saved by sysenter

	- Sysenter Pros:
		- Faster than int
	- Sysenter Cons:
		- Programmer needs to do more work
		- Not all CPUs support it
		- Passing EIP around can be inefficient

	- Linux impl:
		- Kernel supports both Sysenter and Int
		- Kernel figures out what's available on CPU (cpuid)
		- Kernel sets up a virtual syscall page w/ best syscall choice

	- Virtual Dynamic Shared Object:
		- Originally mapped at fixed user addr
		- Replaces int 0x80 w/ call <addr>
		- Supports fixed point return (?)
		- Can support extra functionality (like gettimeofday)

Syscalls - AMD proposal:
	- Same idea as sysenter, but w/o fixed return point
	- Adopted by everyone
	- VDSO only really needed then for extra funcs (like gettimeofday)



== 07. Paging ==

Page Fault:
	- Special Exception
	- Purpose:
		- MMU needs attention
		- Basis for demand paging
	- Types:
		- Invalid: Access to invalid address
			- Ex. Write to Read only page
		- Major: Access is valid, but page not in memory
			- Ex: Page has been swapped
		- Minor: Access is valid and page is in memory
			- Ex: First access to anonymous memory
	- Kernel Faults:
		- Invalid if kernel address
		- Major: kernel is paged out (eg. Windows)
		- Minor: lazy allocation (eg. vmalloc)
	- User: Everything goes

Page Fault Handling:
	- CR2 register contains faulting addr
	- Get info on faulting process (get_current()->task)
	- Get info on task's address space (" "->mm)
	- Get VMA region info by looking up faulting addr in VMAs
	- Get PTE info by walking task's page tables

Demand Paging:
	- Only allocates a page frame on first access
	- Allocation is now just a VMA
		- mmap permission might cause fame allocation as well
	- On PF, if faulting addr is covered by a VMA:
			- Kernel Assigns page frame to task
			- Kernel create a PTE entry and set it in the tasks PTs
			- Switches back to task execution
				- Task is unaware that anything happened

VMA Operations:
	- mmap:
		- Maps new region in addr space
		- Flags include:
			- READ/WRITE/EXECUTE
			- POPULATE: Immidiately allocate frame
			- ANONYMOUS: Generic, demand paged (if used) region
		- Tries to find an accomodating gap in VMAs, and creates new on in that space if found.
		- Faults new pages in if requested

	- munmap:
		- Unmaps a region in the address space
		- Simply deletes VMA in best case:
			- Region contains no mapped page frames
		- Otherwise, needs to:
			- Remove mapped page frames
			- Remove any PTEs pointing to these pages
			- Flush any TLB entries pointing to encompassed pages

	- mprotect:
		- Changes a region's protection bits
		- Simply updates a VMA's protection bits in best case:
			- Region specified -eq to a single VMA
			- Region contains no mapped pages
		- Otherwise, needs to:
			- Split/Merge VMAs, possibly making a new one to accomdate
			- Update effected PTEs

	- madvise:
		- Gives Kernel adivise on a region:
			- MADV_DONTNEED: not likely to be used again soon, kernel can free resources associated with region.
			- MADV_WILLNEED: likely to be used soon, kernel can cache the page(s) to improve latency of future accesses
		- Used for simple paging policies
		- More complex policies supported w/ eg. Userfaultfd (UFFD):
			- User-level page fault handling
			- Allows arbitrary paging policies in userland:
				- App faults
				- Kernel calls UFFD thread with PF info
				- UFFD handles PF and returns
					- Potentially fancy polices
				- App thread resumes

	- Find and Split:
		- The above methods work on arbitrary memory ranges
			- If Range != VMA, we have to do more work
		- Potential for splitting and merging of VMAs
		- Above calls may therefor cause multiple splits/merges
		- Without merging, #VMAs could balloon
			- VMA lookup will get slower and slower
		- Solution:
			- Contigious VMAs with identical properties are merged

VMAs and Huge Pages:
	- Issues:
		- Requires application awareness
		- Static limit on number available
		- Can't swap or mix with regular pages in VMA
	- Libhugetlbfs:
		- Overrides mmap, munmap, etc. to allocate huge pages

Transparent Huge Pages:
	- Designed to improve on libhugetlbfs
	- Idea: Kernel automatically allocates THPs whenever possible (eg. when the alignment is correct)
	- Applications don't need to be THP aware, however:
		- An aware application can map with correct alignment
		- madvise's MADV_NOHUGEPAGES if only going to use a couple bytes

	- Operations:
		- Map:
			- Allocate compound pages and map them as THPs
			- "Compound Page": group of phys-contiguous pages
			- When to do it:
				- At PF, periodically, never
				- Currently done at PF (may slow PFs)
			- What situations to do it in:
				- Mapped Huge Page within owning VMA (assuming aligned)
				- Huge pmd available (pmd?)
		
		- Collapse:
			- Replace a # of regular pages with THP
			- When to do it:
				- PF, periodically, never
				- Currently done periodically (khugepaged)
				- Scans N pages every K intervals
			- How:
				- Replaces pages with compound page
				- Replace PTEs with pmd

		- Split:
			- Split THP into a # of regular pages
			- When to do it:
				- At VMA split(4k), if DONTNEED(4k) set, etc.
				- Any time THP-unaware kernel code executes
			- Promted issue with jemalloc
				- Result: THP not so transparent...

		- Compact:
			- Memory compaction s.t. compound pages can be allocated
			- Needed in presence of high fragmentation
			- Both Direct (failed alloc) & periodic used in Linux
			- kcompactd:
				- Kernel thread for periodic compaction
				- Direct compaction waits for kcompactd
				- Used by PF handling code and khugepaged

== 08. Multiprocessing ==

Fork:
	- Creates a new child process by duplicating the parent
	- Implemented on top of clone() in Linux
	- Duplicate Task:
		- Copy most info, some exceptions, eg. PID
	- Allocate and initialize new knerel stack
		- Setup thread_info
		- Copy trap frame and update (eg. $eax)
	- Copy Memory:
		- Initialize empty addr space (new page dir)
	- Duplicate Addr space:
		- Copy VMA information
		- Copy page tables
		- Fixup PTEs:
			- Eg. COW page frames

Exec:
	- Executes the program pointed to by filename
	- Implementation:
		- Check input & permissions
		- Load binary headers into memory
		- Determine binary format
		- Flush old resources:
			- Reinit task struct
			- Flush VMAs, PTs, Page Frames
		- Load the binary:
			- Parse headers & Sections
			- Create VMAs for Data, Text, Stack etc.`
			- Update %esp, %eip in trap frame:
				- %esp = top of user stack
				- %eip = program entry point (for statically linked)
				- %eip = dynamic linker's entry point otherwise
		- Page tables initially empty:
			- Even binary file pages (eg. text) demand paging
			- PF handler maps them from cache using COW semantics for safe sharing (?)

Copy on Write (COW):
	- perm(VMA) = R/W, perm(PTE) = R
	- At PF, kernel sees discrepency in permissions
		- Duplicates page frame, remaps new page into faulting task's page tables w/ R/W perms
	- Application:
		- MAP_PRIVATE: file pages
			- Deduplicates binary pages for unrelated processes
			- Many (eg. Text pages) never COWed
		- MAP_PRIVATE: anon forked pages:
			- Deduplicates pages with process hierarchy
			- Many (eg. fork, exec) never COWed
		- MAP_PRIVATE: anon zero pages
			- Deduplicates zero pages for unrelated processes
			- At first PF, map single read-only zero page
			- COW at first write PF
			- Only used in specific cases (eg. THP) for scalability

Inter Process Communication (IPC):
	- Allows processes to communicate/synchronize with eachother
	- Many different methods/interfaces
	- System V vs. POSIX:
		- Similar:
			- Shared Memory Region
			- Semaphores: sync w/ post/wait primitives
			- Message queues: interprocess message exchange
	- Shared Memory (Sys. V):
		- Get/create shmem segment by key/perm
		- Attach/detach by addr/key

	- Semaphones (Sys. V):
		- ...

	- Message Queues:
		- Get/create message queue by key, perm
		- Send/recv. message on message queue by queue id
		- Blocks if queue is full (send) or empty (recv.)
		- Queue is size limited

	- POSIX IPC:
		- Uses names, not keys
		- Uses reference counting
		- Provides thread safety
		- Shared Memory is file-oriented

Scheduling:
	- How do we schedule multiple processes on a given CPU?
	- Easy way (JOS):
		- Hardware raises timer interrupts
		- Interrupt handler invokes simple round-robin scheduler
		- It's Preemptive in that tasks get interrupted w/o their permission
		- No fairness, no priorities, etc

	- Time Management:
		- Hardware offers clocks/timers
		- Exposes counters, and timers or varying precesions
		- These timers can issue interrupts, and used for scheduling
		- Clock's include:
			- tsc: Timestamp counter
			- hpet: High-precesion event timer
			- acpi_pm: ACPI power management timer
		- Non-clock types:
			- RTC: Real time clock:
				- Persistent (battery powered)
				- Low-precesion
				- Good for maintaining date
		- Misc:
			- Jiffies: ticks since startup
			- xtime: current date/time

	- Tasks & Scheduling:
		- Tasks's State: Running, Runnable, Sleeping, etc.
		- Quantam/time slice:
			- Max jiffies a task can run on CPU
			- Set at creation
			- Decremented per tick
			- Good enough to ensure fairness for CPU intensive tasks
		- Priority:
			- Set at creation
			- Could be modified at some point
		- Scheduling policy:
			- Normal, Batch, Idle -> Completely fair scheduler (?)
			- FIFO, RR -> Real-time scheduler (?)

	- Tickless Kernel:
		- Idea:
			- Tasks doing little will waste most of their slice
			- Can improve system if we schedule them off "when done"
		- Improve Responsiveness at cost of overhead
		- Create a new timer for "end of quantam" event
		- Reprogram hardware to tick at next timer to expire

	- Linux O(1) Scheduler:
		- Preemptive round-robin scheduler
		- Once default, now its closer to RT
		- Maintains N run queues (N priorities)
		- Strategy:
			- Find highest pri. queue w/ runnable task
			- Dequeue first task on that queue
			- Calc. its time slice (based on its priority), and run
			- At time up, enqueue it (at its priority level) and repeat
		- Improvements:
			- Priorities are adjusted based on 'sleep time'
				- Sleep time: the amount of time a task does nothing (?)
			- Priority bonuses for I/O vs. CPU heavy processes
	
	- Linux CFS (Completely Fair Scheduling) Scheduler:
		- O(1) used lots of hard-to-maintain hacks to fairness
		- Idea (w/ N tasks):
			- Record how much CPU time each task has been given
			- Schedule task with biggest delta tot_CPU_time/N
			- Virtualize time (virtual runtime) to deal with priorities
			- Increase virtual runtime faster for lower-priority queues
		- No heuristics to distinguish tasks
		- No run queues, uses RB tree (indexed by virt. runtime)



== 09. Multicore ==

Kernel on Multicore needs to:
	- Start/Stop cores
		- x86 APIC
	- Deal with consistency/scaling issues
		- Kernel Locking, partitioning, replication
	- Schedule work
		- User/Kernel threads

Starting/Stopping Cores:
	- BIOS hands execution to one core
		- Called Bootstrapping core (BSP) in Intel lingo
	- BSP then starts other cores (Application Cores in Intel lingo)
	- BSP starts APs by sending a special interrupt

Advanced Programmable Interrupt Controller:
	- IRQ: Interrupt Request
	- Sits inbetween IO devices (ethernet, RTC, Keyboard, etc.) and CPU
	- Enabling:
		- Find the MP configuration Table
		- Set a bit in the spurious-interrupt vector register
	- Advanced Features include:
		- Thermal Management, Internal Error Reporting
	- Interprocessor interrupts (IPI):
		- Send interrupts to another core
		- Forward an interrupt
		- Start another core
	- Sending IPIs:
		- Need to know the destination core APIC ID
			- Read the config table from memory
		- Need to know the APIC register address
			- Read the config table from memory
		- Need to initiate the IPI
	- Starting APs:
		- Send an IPI INIT
			- Resets a given core specified with its APIC ID
			- Reset = reset of core's subsystems
		- Send a start-up IPI (SIPI):
			- Includes an entry point
			- Stared core begin executing at this entry point

Multicore Scaling:
	- Switching draws lots of power with many cores
	- Can't keep all the transistors on all the time
	- At 8nm 50% of chip's transistors need to be off
	- Solutions:
		- Lower frequency
		- Specialized Cores
		- Turn cores on/off quickly
	- Core hotplugging:
		- Slow on commodity OSs
		- Need to maintain shared state

Multicore Consistency/Scalability:
	- Cores can execute in the kernel concurrently
		- Need to keep kernel state consistent
		- eg. Frame allocator, IO devices (like console), scheduler, etc.
	- Protecting state:
		- Lock the state (costs performance)
		- Lock the the state when needed (backward incompatible)
		- Partition state (underutilization of resources)
		- Replicate it (stale state)

Locks:
	- Spinlocks:
		- Pros: easy, fast
		- Cons: wasted cycles, scalability issues
	- Mutexes:
		- Pros: easy, wasteless (core can do something while it waits)
		- Cons: high latency
	- Read-copy-update aka RCU
		- Pros: fast, scalable in most situation
		- Cons: waste memory, complex

	- Big Kernel Lock (BKL):
		- Single spin lock to protect entire kernel
		- Simple, but only one core can work in kernel at a time
			- eg. system can't write to disk while webcam is running

	- Fine-grained Locking:
		- Partition the BKL
			- Seperate locks for subsystems
			- Fine-graining can continue down to structure level
		- Complex, but allows concurrent kernel execution
		- Requires same lock ordering everywhere

Multikernel:
	- Replicate the state as a principle
	- Combats locks not scaling well
	- Produces a scalable kernel through replication

State Replication in Linux:
	- Partial
	- Create a copy of hot data structs. per CPU
	- Each core can access its own structures lock free
	- Caches per core to alleviate load on certain subsystems


Scheduling on Multicore:
	- Requires multiple threads of execution
	- User mode apps can spawn threads
	- Kernel schedules these threads on idle cores
	- Parallelism for kernel tasks:
		- Post interrupt work
		- Background work:
			- Filling per-CPU frame caches
			- Writing dirty data to disk

Concurrency in Linux:
	- Interrupts do as little as possible
	- Deferring work (bottom half)
		- SoftIRQs
			- High Priority work (w/ interrupts enabled)
			- Per CPU bitmap of softIRQs needing work
			- Kernel checks bitmap after interrupt/periodically
		- Tasklets
			- Dynamically create deferred work
			- Run as part of softIRQs
			- Serialized (no locking)
		- Workqueues
			- Unlike softIRQs/tasklets, these run in process context
				- Backed by kthreads
				- Graceful resource sharing and they can sleep
				- Harder to program compared to tasklets..
		- KThreads
			- Unit of parallelism in Linux kernel
			- Back the bulk of softIRQs and workqueues
			- Use the waitqueue mechanism to get notified of work to do


== 10. Page Reclaiming ==

Page Reclaiming techniques:
	- Cache Shrinking: shrink or drop OS maintained caches
	- Out-of-Memory (OOM) killing: kill a process to reclaim its pages
	- Swapping: swap likely unused pages to disk
	- Compression: compress likely unsed pages in memory
	- Deduplication: Merge and COW pages with idential content

Direct v. Indirect reclamation:
	- Direct:
		- Triggered when the kernel fails to allocate memory
		- Exceptions apply (THP)
	- Indirect (eg. kswapd):
		- Triggered when memory pressure is approaching
		- Free pages to stay below a available memory threshold

Cache Shrinking:
	- Idea: eliminate "harmless" pages first
	- Page Cache: reclaim pages not referenced by any process
	- Slab Caches: reclaim pages from slab with no allocated objects
	- Dentry cache: Reclaim dentries not referenced by any process
	- Inode cache: Reclaim inode objects with no controlling dentry

OOM Killing:
	- Crude, but:
		- Very effective way to recover pages
		- Acceptable for unimportant processes
		- Last resort
	- Direct Reclaiming:
		- Linux OOM Killer
		- Assigns badness scores and kills worst task
	- Periodic Reclaiming:
		 - Android's low-memory killer
		 - Similar to OOM killer, targets background apps

Page Swapping:
	- Swap space on disk creates illusion of huge virt. memory (backing store)
	- Traditional solution for page rec.
	- Losing Prominence as ram gets bigger/cheaper, IO is slow, etc
	- Serves some very important features however:
		- Hibernation on consumer platforms
		- Database backing on some server platforms

	- Mechanics:
		- Swap-out:
			- At page reclaim, store content in a swap entry on disk
			- Frame must be unmapped in all owning PTEs
		- Swap-in:
			- At PF, read swap entry into a new frame and map it in
			- Need to locate the swap entry on disk
		- Requirements:
			- Reverse mapping: page -> PTEs (swap out)
			- Direct mapping: PTE -> swap entry (swap in)
			- Handle races with concurrent swap-ins/outs
				- Scratch area in the page cache

Page Replacement:
	- How to choose pages to swap?
		- Need a page replacement algo
	- Ideal:
		- Replace page that will be referenced as far in the future as possible
		- Can be aproximated with LRU-like (Least Recently Used) strategies in practice
	- FIFO:
		- Use a queue of faulting pages and replace the head
		- Costly on frequently accessed pages (?)
	- Second Chance/CLOCK:
		- Improved FIFO to preserve important pages
		- For each visited page:
			- if R=0, replace it
			- if R=1, set R=0, and move to tail
			- R set 1 if accessed
		- CLOCK uses a circular list and moves head

Linux' Page Reclaiming:
	- Active/Inactive LRUs
		- Split LRU list into active and inactive LRUs
		- Both managed CLOCK style
		- Active pages considered in use (working set)
		- Inactive pages unmapped (ready for reclaim)
		- Many per zone LRUs in modern Linux

	- Active/Inactive Updates:
		- New anon pages added to active
		- New file pages added to inactive
		- Inactive pages -> active on PF
		- Active pages periodically moved to inactive (kswapd)
			- What Ratio of active:inactive to maintain?
			- 1:1 proved not great
		- Better ratio should:
			- Keep active list big enough to avoid page faults
			- keep inactive list to give pages a 2nd chance

	- Active/Inactive Balancing:
		- A/I ratio determined adaptively using global "refault distance":
			- Per-page distance: how much larger should inactive list be to avoid evicting this page?
			- On Inactive list evection, a distance counter tracks #evictions until page faults again

Page Cache:
	- Stores pages of given owner
	- The swap cache is a subset
	- Unified page cache for many purposes:
		- Hold in transit swap pages
		- Serve file read/write efficiently
		- Reverse file mappings
	- Fundamental Operations:
		- Lookup: used by PF handler (disk read on miss)
		- Add: used when fetching page from disk
		- Delete: used by page reclaiming
	- N address_space objects, 1 for each owner (?)

Reverse Mapping: Page Desc. -> PTEs:
	- Each page descriptor stores pointer to rmap data struct., indexing all VMAs which may map the page
	- Given a VMA, we can PT walk via mm->pgd
	- File pages:
		- Mapping points to address_space data struct
		- Uses RB tree to index possibly many VMAs mapping the target page
	- (Private) Anon. Pages:
		- Mapping points to anon_vma struct
		- Unless page is in swap cache

Direct Mapping: PTE to Swap Entry:
	- System maintains a number of swap areas, each with a priority and an array of available entries
	- Each swap entry maps to a block on the disk at a linear offset
	- At swap out, PTE is set w/:
		- Area #
		- Swap entry offset
	- At PF, entry lookup on swap cache of disk


Page Compression:
	- Idea: Use CPU for "swapping" rather than a slower I/O sys.
	- Page out to a compressed memory cache
	- Can replace or complement swapping
	- Compression algos include:
		- LZO (default)
		- LZ4 (better compression)
	- Methods include:
		- ZRAM: Swap Replacement
		- ZSWAP: Swap complement
		- ZCACHE: Swap complement

	- ZRAM:
		- Uses a compressed RAM disk swap device
		- Pages are compressed at swap out
		- Pages are decompressed at swap in
		- No changes in swap management required
		- Can work as only swapping method, or next to other devices
		- ZRAM typically configured as highest priority swap device
		- Other devices are used when highest prio. device is full
		- Disadvantages:
			- Requires manual swap device config
			- When cache is full, system no longer benefits from it
			- LRU inversion: Most recently swapped pages go to slow disk
				- Stalest pages quickly fill ZRAM cache, leaving more frequently accessed pages to be swapped on slower devices (disk)

	- ZSWAP:
		- Write-back cache for regular swap devices
		- Stores compressed pages in zswap pool
		- Uses zbud allocator (default) to store 2 pages in a compressed page
		- Swap out non compressible page directly
		- Compressed pages swapped to disk in LRU fashion when pool reaches pre set size
		- Greatly reduces disk I/O due to swapping
		- Disadvantage:
			- Assume dedicated disk swap device

	- ZCACHE:
		- Write-back cache similar to zswap
		- More ambitious than ZSWAP:
			- Compressed cache for swapped pages
			- Compressed cache for clean pages
		- More complicated, not commonly used
		- ...


Page Deduplication:
	- Proactively find pages with identical content
	- Merge them, and COW for safe sharing
	- Why bother if COW already used for fork/exec?
		- eg. Virtualization: dedupe anon pages across guests
		- eg. Special workloads with lots of data duplication

	- Problems:
		- Performance:
			- Proactive reclaiming may eleiminate memory pressure, or just waste cycles
			- Possible to cause COW storm
		- Security:
			- Cache Attacks: enables flush+reload for otherwise unrelated pages in different security domains
			- Dedup Est Machina: yields "does page exist?" oracle via COW timing
			- Flip Feng Shui: provides massaging prim. to map a target page into a physical page vulnerable to hardware bit flips

Kernel Samepage Merging (KSM):
	- In-kernel dedup system for priv anon memory
	- Designed for virtualization, but useful in general
	- Opt-in Policy: VMA flag allows merging
	- Mechanics:
		- Kthread (ksmd) scans P mergable pages every S seconds
		- Maintains RB tree of merged/unmerged pages indexed by content
		- At every scan, ksmd walks the tree P times, looking for exact match
		- For each exact match, ksmd merges the page frames and write-protects owning PTEs
		- Merged pages can be COWed, or even swapped
	- Stable Tree:
		- Contains only already merged pages
		- Their content cannot change (or COW PF)
		- RB tree is consistent by design
		- ksmd can safely walk it for each page looking for exact matches
	- Unstable Tree:
		- Contains unmerged page, content changes
		- Can be inconsistent, lookups can lead to FNs (?)
		- To mitigate needless walks:
			- ksmd can rebuild the tree at scan