From d4f6774c172ac1e7c193fc4e89230c873d179c2b Mon Sep 17 00:00:00 2001
From: Raymaekers Luca <luca@spacehb.net>
Date: Wed, 12 Nov 2025 18:52:38 +0100
Subject: checkpoint

---
 computerenhance.md | 161 +++++++++++++++++++++++++----------------------------
 1 file changed, 77 insertions(+), 84 deletions(-)

(limited to 'computerenhance.md')

diff --git a/computerenhance.md b/computerenhance.md
index 8eaa819..ed7c73a 100644
--- a/computerenhance.md
+++ b/computerenhance.md
@@ -2,11 +2,11 @@
 
 # 2. [Performance-Aware Programming Series Begins February 1st](https://www.computerenhance.com/p/performance-aware-programming-series)
 # 3. [Welcome to the Performance-Aware Programming Series!](https://www.computerenhance.com/p/welcome-to-the-performance-aware)
-# Optimization
+## Optimization
 optimization = Maximizing performance of software on hardware.
 - used to be complicated (years ago) = traditional
 
-# Problem
+## Problem
 People think performance is:
 - not worth it
 - too hard
@@ -14,19 +14,19 @@ People think performance is:
 True for traditional optimzation.
 - actually results in extremely slow programming.
 
-# What
+## What
 Only learning about how performance works is enough.
 - What decisions affect performance?
 
-# Affecting performance
+## Affecting performance
 1. amount of instructions: reduce number of instructions
 2. speed of instructions: make CPU process instructions faster
 
-# Decline
+## Decline
 - CPUs became more complicated.
 - unaware of the resulting instructions of higher-level languages
 
-# Solution
+## Solution
 - Keep result of instructions in mind, not code
 - Learn what the maximum speed of something should be
 
@@ -58,7 +58,7 @@ Python had 180x instructions and was 130x slower.
 # 5. [Instructions Per Clock](https://www.computerenhance.com/p/instructions-per-clock)
 *speed of instructions*
 
-# IPC/ILP
+## IPC/ILP
 - **Instructions Per Clock**
   - instructions per clock cycle
   - specific to one instruction
@@ -101,7 +101,7 @@ CPUs are designed for more computation so boosting IPL in a loop that
 does not do a lot of computation will bring less benefits.
 
 # 6. [Monday Q&A (2023-02-05)](https://www.computerenhance.com/p/monday-q-and-a-2023-02-05)
-# JIT
+## JIT
 - compile code "upfront"
 - have extra information
   - knows context in which it is compiled
@@ -114,7 +114,7 @@ Waste:
 JAVA may have waste as well, but the bytecode was designed as
 "something that could be executed faster".
 
-# No test harness yet.
+## No test harness yet.
 Read Time Stamp Counter (rtsc):
 - allows to read chip-wide counter
 - dangerous against turbo boosts
@@ -125,7 +125,7 @@ We have to move by stage so we can focus on everything step by step.
 The reality is that there already is a well grown userbase.
 - who cares!  We will make it performant.
 
-# Micro-OPs go to execution ports
+## Micro-OPs go to execution ports
 - each port can do X operations
 - different instructions can be contenders for a port
 - looking at the port usage will show for unrolling a loop
@@ -133,23 +133,16 @@ The reality is that there already is a well grown userbase.
 - sometimes there is a limit on how instructions can be sent
 
 
-# Unrolling
+## Unrolling
 - most compilers can unroll the loop for you
   - clang can screw this up
 
 
-# Why *minimum* adds per cycle
+## Why *minimum* adds per cycle
 - thinking about "opportunities"
-- mean, median will not find out the average
-- + fastest you can show what you are pointing to
-  - mean + fastest
-- analyzing the behaviour of the hardware
-- *stars need to align*
-- when using fastest you converge to the analysis 
-- *educational*
-- mean and medium or for "mixtures"
-  - used together for fastest when optimizing
-
+- mean & median will not converge to the actual cycle count the CPU can do them.
+- *fastest* will get to the actual adds/cycle the CPU can do
+- median & mean gives you typical behavior
 
 Assembly -> Micro-OPs -> CPU
 
@@ -167,19 +160,19 @@ input[2]
 input[3]
 input += 4
 ```
-# Three-based addition:
+## Three-based addition:
 - common technique to work out a dependency chain
 
 # 7. [Single Instruction, Multiple Data](https://www.computerenhance.com/p/single-instruction-multiple-data)
 *Amount of instructions*
 
-# SIMD
+## SIMD
 - *Single Instruction, Multiple Data*
 - One instruction can act on multiple data
 - SSE in x64
 - can be used together with IPC
 
-# PADDD
+## PADDD
 - Packed ADD D-word
 - "Wide" instruction
   - can use multiple accumulators
@@ -187,7 +180,7 @@ input += 4
   - e.g. extracting dependency chains
 - ![vector example](img/vector_paddd.png)
 
-# Subsets
+## Subsets
 |---------+-----------+-----------|
 | subset  | bit width | supported |
 |---------+-----------+-----------|
@@ -200,7 +193,7 @@ input += 4
 - Making smaller instructions can use bit width.
 - Typical that you cannot get full x improvements (x2, x4, ...)
 
-# Difficulty
+## Difficulty
 - SIMD does not care about how data is organized
 - easy with adds
 
@@ -217,49 +210,49 @@ Because there are many dependencies on loads it is very important.
 - *Cache*!
   - Way faster than main memory (DIMMs)
 
-# Cache
+## Cache
 - Register file :
   - produce values really quickly and feed them to registers
   - maximum speed
   - few hundred values at most
 
 # 9. [Monday Q&A #2 (2023-02-12)](https://www.computerenhance.com/p/monday-q-and-a-2-2023-02-12)
-# Why would the register renamer not solve the dependencies?
+## Why would the register renamer not solve the dependencies?
 - Because there is a "literal" dependency
 - *register renamer* fixes "fake" dependencies
 
-# Python over cpp
+## Python over cpp
 - cpp is quite bad, but allows control over the output
 - python is good for sketching and libraries
 
-# Hardware jungle
+## Hardware jungle
 - Coding towards the minimum specification
   - generally true
 - Design hotspots for platforms (e.g. Xbox, PS5, ...)
 - vectorizable, enough in loop for IPC
   - focus on things that work on all CPUs
 
-# More complicated loops
+## More complicated loops
 - For now it's demonstrations
 - everything can be optimized (:
     
-# How can you tell if you are wasteful?
+## How can you tell if you are wasteful?
 - profiling
   - "how much time spending in this loop?"
 
-# Lanes and bottlenecks during design
+## Lanes and bottlenecks during design
 - how many of "those" can I do per second
   - which is the limiting factor = bottleneck
 
-# Asymptotic performance
+## Asymptotic performance
 - also important
 
-# Power usage reduction
+## Power usage reduction
 - in general achieved through reduced instructions
 - same thing the *majority* of the time
   - reducing waste
 
-# Signed and unsigned integers
+## Signed and unsigned integers
 - are the same because of *two's complement*
 - except for:
   - mul/div/gt
@@ -267,14 +260,14 @@ Because there are many dependencies on loads it is very important.
 - name of instructions tells the compiler which instruction to use
 - unsigned/signed is not needed and could be replaced by a different operator
 
-# Can compilers SIMD?
+## Can compilers SIMD?
 - gcc and clang are agressive at vectorization
   - generally better than nothing
 
-# SIMD: AMD vs Intel
+## SIMD: AMD vs Intel
 - no. (long-term)
 
-# Are unused registers ignored?
+## Are unused registers ignored?
 - modern chips (CPU/GPU) have two types of registers:
   - Scalar
     - slot
@@ -287,7 +280,7 @@ Because there are many dependencies on loads it is very important.
   - tip: *SIMD if you can*
 
 
-# CPU vs GPU
+## CPU vs GPU
 - GPUs are the same for parallellism but with different trade-offs
   - 1024bit Vectors / Wide execution units
   - CPU
@@ -306,33 +299,33 @@ Because there are many dependencies on loads it is very important.
 - Switch can be difficult (talkin between both)
   - unless APU
 
-# Non-deterministic architectures
+## Non-deterministic architectures
 - You cannot depend on timings
 - Potentially depends on the room's temperature
 - Things run until they cannot melt
 
-# Arm
+## Arm
 - everything transfers
 - Instructions name change
 
-# SIMD without SIMD
+## SIMD without SIMD
 - leave one bit for overflow
 - SIMD registers handle overflows and carrying
   
-# Slowdown when 256/512 registers
+## Slowdown when 256/512 registers
 - Most machines downclocks when using AVX
   
-# Hardware Jungle: SIMD edition
+## Hardware Jungle: SIMD edition
 - 'cpu_id' tells what instructions sets the CPU supports
   - set function pointers accordingly
 - SHIM? 
 
 
-# Micro-OP debugging
+## Micro-OP debugging
 - assembly instructions are a layer of debugging on top of micro-OPs
 - You cannot micro-op debug 
 
-# Out-of-order CPUs
+## Out-of-order CPUs
 - only if no one can tell the order was violated
 - inside a window
 - limited:
@@ -341,7 +334,7 @@ Because there are many dependencies on loads it is very important.
 - rolling window
   - waiting for instructions to be done
 
-# SIMD dataset requirements
+## SIMD dataset requirements
 - "Shot"
 - "Tail"
 - You can padd you inputs
@@ -353,22 +346,22 @@ Because there are many dependencies on loads it is very important.
   - Vector: VLen (Vector Length) says how much elements
 - Scaler loop in worst case scenario
 
-# Cost of SIMD
+## Cost of SIMD
 - 128 can always be used
 - clock penalty goes away over time
 - more latent
 
-# Latency vs Pipelining
+## Latency vs Pipelining
 - Latency can be beaten by pipelining
 - if the instructions are independent then the latency does not matter
 
-# Instructions Limits
+## Instructions Limits
 - there is a limit on the number of micro-ops a cycle
 - 5 micro-ops cannot be adds
   - load tides up the execution port
 - Registers are next to the lanes
 
-# Cache control
+## Cache control
 - responsive, guesses
 - "hints"
   - prefetch instruction :: tries to get the memory in cache
@@ -380,13 +373,13 @@ Because there are many dependencies on loads it is very important.
 - only matters when you have data you *do* want to cache
 
 
-# Data going around cache
+## Data going around cache
 - bandwith is way bigger than the amount that needs to be passed through
 - bandwith becomes narrower
 - [cache_wideness][img/cache_wideness.png]
 - the place having the data decides the bandwith
 
-# Prefetches
+## Prefetches
 - hardware and software
 - hardware ::
   - looks at the pattern of pages
@@ -394,16 +387,16 @@ Because there are many dependencies on loads it is very important.
 - eliminates latency
 - throughput stays the same (cache bandwith)
 
-# Other programs
+## Other programs
 - processor does not care about *what* it caches
 - you lose depending on the core
 - programms waking up takes up cache
 - speed decrease is not in ^2 but size of bytes per cycle
 
-# Cache runtime
+## Cache runtime
 - The slower the routine, the less the cache is important
 
-# Cache lines
+## Cache lines
 - every 64 bytes
 - being memory aligned penalties :: (not very high)
   - 4096B (page boundaries)
@@ -413,17 +406,17 @@ Because there are many dependencies on loads it is very important.
 - best to use all the cache lines (all 64 bytes)
   - via data structures
 
-# Checking the cache
+## Checking the cache
 - checking and getting happens in one instruction
 
-# Cache fights
+## Cache fights
 - in multithreaded beware of shared caching
 - pulling in the cache can evict the current data
 
-# L1 cache supremacy
+## L1 cache supremacy
 - constrained by distance
 
-# Instructions in cache
+## Instructions in cache
 - Front end ::
   - ICache
 - Back end ::
@@ -432,11 +425,11 @@ Because there are many dependencies on loads it is very important.
 - Unpredicted instructions can slow down the program
 - In L2/L3 cache instructions take space.
 
-# Performance for different sizes
+## Performance for different sizes
 - smaller sizes are more likely to be cached
 - if the cache is *primed* then yes
 
-# Cache behaviour
+## Cache behaviour
 - branch predictor :: which way
   - sophisticated
 - hardware prefetcher :: which memory is going to be touched
@@ -444,15 +437,15 @@ Because there are many dependencies on loads it is very important.
   - not smart
 - "warmed up"
 
-# Persistent cache
+## Persistent cache
 - OS wipes out with new memory map
     
-# Ask for cache
+## Ask for cache
 - Evict train
   - evict, ask L2, evict, ask L3, evict ask memory, fill evicts
 - DMA (Direct Memory Acces)
 
-# Inclusive vs Exclusive Caches
+## Inclusive vs Exclusive Caches
 - exclusive cache ::
   - data is not in L2
   - only when evicted from L1
@@ -463,20 +456,20 @@ Because there are many dependencies on loads it is very important.
 # 10. [Multithreading](https://www.computerenhance.com/p/multithreading)
 *Increasing speed of instructions*
 
-# Multithreading
+## Multithreading
 - Core :: different computers
   - physical
 - Threads :: interface to access cores through the OS
   - OS
 
-# Speeding up
+## Speeding up
 - not x2 x4, actually account cache for speed up
 - more SIMD registers, instructions cache, ...
 - shared caches add up
 - memory bandwidth can be bottleneck
   - sometimes does not add up
 
-# Forcing out of memory
+## Forcing out of memory
 - bandwith does not increase a lot when using main memory
   - depending on the chip
 - L3 cache and main memory are shared (not big speed ups)
@@ -484,11 +477,11 @@ Because there are many dependencies on loads it is very important.
 # 11. [Python Revisited](https://www.computerenhance.com/p/python-revisited)
 Assembly is what determines the speed.
 
-# Python
+## Python
 - doing every sum in python is slow
 - numpy is faster when you have supplied the array with a type
 # 12. [Monday Q&A #3 (2023-02-20)](https://www.computerenhance.com/p/monday-q-and-a-3-2023-02-20)
-# Hyperthreading & Branch prediction
+## Hyperthreading & Branch prediction
 - hyperthreads ::
   - [[./img/hyperthreading.png][hyperthreading]]
   - polling for more than one instructions
@@ -508,26 +501,26 @@ Assembly is what determines the speed.
 - IPC: more execution ports filled
 
 
-# Multithreaded
+## Multithreaded
 - code so that threads do not talk to each other
   - communication is a mistake
 - sync the code
 
 
-# Max multiplier mulitthreading
+## Max multiplier mulitthreading
 - fetching memory is slower than computation
 - look at all-core-bandwith
   - total bandwith to all cores
   - divided by cores = max memory per cycle
     
 
-# Logical processors vs Cores
+## Logical processors vs Cores
 - Cores = computers
 - Logical processors
   - OS / threads / instruction streams
 
 
-# thread count > L1 cache
+## thread count > L1 cache
 - oversubscription ::
   - when the program asks for more threads than available
   - lot of eviction
@@ -537,11 +530,11 @@ Assembly is what determines the speed.
 - OS tries to run thread as long as possible
   
 
-# Green thread / fibers
+## Green thread / fibers
 - software control swapping of the OS
 
 
-# Multithreadeding with disks
+## Multithreadeding with disks
 - micromanagement
   - when CPU has to decrypt
 - depends on how disk works
@@ -549,7 +542,7 @@ Assembly is what determines the speed.
 - threads can make non blocking code
 
 
-# How to get memory bandwidth
+## How to get memory bandwidth
 - https://github.com/cmuratori/blandwidth
 
 # 13. [The Haversine Distance Problem](https://www.computerenhance.com/p/the-haversine-distance-problem)
@@ -565,20 +558,20 @@ Assembly is what determines the speed.
 The 8086 instruction set architecture is easier.
 - Better for understanding concepts.
 
-# Register
+## Register
 - place to store information
 - 16 bits on the 8086
 
-# Operations
+## Operations
 1. load memory
    - copy into register
 2. compute
 3. write to memory
    
-# Instruction Decode
+## Instruction Decode
 Turning the instruction stream into hardware operations.
 
-# Instructions
+## Instructions
 - mov ::
   - move, but actually a /copy/
 - are assembled into binary that the /Instruction Decoder/ can use to
@@ -641,7 +634,7 @@ Special case when MOD is 00 and R/M is 110 -> 16 bit displacement.
 
 Immediately (available) value.
 
-# Assignment
+## Assignment
 Also implement memory to register.
 - going to have to read the displacement bits (DISP)
 
-- 
cgit v1.2.3-70-g09d2