summaryrefslogtreecommitdiff
path: root/computerenhance.md
blob: b43f62fb51c7c857fa7933eac185a5be08f18ddb (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
# 1. [Table of Contents](https://www.computerenhance.com/p/table-of-contents)
# 2. [Performance-Aware Programming Series Begins February 1st](https://www.computerenhance.com/p/performance-aware-programming-series)
# 3. [Welcome to the Performance-Aware Programming Series!](https://www.computerenhance.com/p/welcome-to-the-performance-aware)
# Optimization
optimization = Maximizing performance of software on hardware.
- used to be complicated (years ago) = traditional

# Problem
People think performance is:
- not worth it
- too hard

True for traditional optimzation.
- actually results in extremely slow programming.

# What
Only learning about how performance works is enough.
- What decisions affect performance?

# Affecting performance
1. amount of instructions: reduce number of instructions
2. speed of instructions: make CPU process instructions faster

# Decline
- CPUs became more complicated.
- unaware of the resulting instructions of higher-level languages

# Solution
- Keep result of instructions in mind, not code
- Learn what the maximum speed of something should be```
# 4. [Waste](https://www.computerenhance.com/p/waste)
Instructions that do not need to be there.
- Often the biggest multiplier

```asm
LEA C, [A+B]
```
- Load Effective Adress
- separated from the destination

You can find waste by looking at the output assembly code.

Eliminating waste is a form of *reducing instructions*.

In the case of python the interpreter code is the waste.
- do not use python for bulk operations (i.e. loops)

Key points:
- recognize waste
- in python: find ways to offload to code with less waste (e.g. C,
  numpy, ...)
- to measure overhead + loop we can measure cycles
  - more instructions != more time

Python had 180x instructions and was 130x slower.```
# 5. [Instructions Per Clock](https://www.computerenhance.com/p/instructions-per-clock)
*speed of instructions*

# IPC/ILP
- **Instructions Per Clock**
  - instructions per clock cycle
  - specific to one instruction
- **Instruction Level Parallelism**
  - accounting all instructions
```c
  for (i = 0; i < count; i +=1)
  {
   sum += input[i];
  }
```
- more instructions than only "adds"
- no way to get to 1x add per cycle
- loop overhead

Reducing ratio of loop overhead / work
- example: loop unrolling
  ```c
    for (i = 0; i < count; i +=2)
    {
     sum += input[i];
     sum += input[i + 1];
    }
  ```
Weird that it would go until to 1x add per cycle.
- what are the chances? overhead??


Multiple instructions can be executed at the same time.
- CPU recognizes their *independency*, (e.g. different locations)
- = parallelism
- "If the destination is the same as the input it cannot be executed at
  the same time."
  - = serial dependency chain

Multiple chains can help break through limits.
- "boosting the IPL"

CPUs are designed for more computation so boosting IPL in a loop that
does not do a lot of computation will bring less benefits.```
# 6. [Monday Q&A (2023-02-05)](https://www.computerenhance.com/p/monday-q-and-a-2023-02-05)
# JIT
- compile code "upfront"
- have extra information
  - knows context in which it is compiled
  - can branch off functions
- Javascript uses it by default

Waste:
- Instructions that are not required

JAVA may have waste as well, but the bytecode was designed as
"something that could be executed faster".

# No test harness yet.
Read Time Stamp Counter (rtsc):
- allows to read chip-wide counter
- dangerous against turbo boosts


We have to move by stage so we can focus on everything step by step.

The reality is that there already is a well grown userbase.
- who cares!  We will make it performant.

# Micro-OPs go to execution ports
- each port can do X operations
- different instructions can be contenders for a port
- looking at the port usage will show for unrolling a loop
- there are tools that can simulate this
- sometimes there is a limit on how instructions can be sent


# Unrolling
- most compilers can unroll the loop for you
  - clang can screw this up


# Why *minimum* adds per cycle
- thinking about "opportunities"
- mean, median will not find out the average
- + fastest you can show what you are pointing to
  - mean + fastest
- analyzing the behaviour of the hardware
- *stars need to align*
- when using fastest you converge to the analysis 
- *educational*
- mean and medium or for "mixtures"
  - used together for fastest when optimizing


Assembly -> Micro-OPs -> CPU

```c
input[index + 0]
input[index + 1]
input[index + 2]
input[index + 3]
```
Is slower than
```c
input[0]
input[1]
input[2]
input[3]
input += 4
```
# Three-based addition:
- common technique to work out a dependency chain```
# 7. [Single Instruction, Multiple Data](https://www.computerenhance.com/p/single-instruction-multiple-data)
*Amount of instructions*

# SIMD
- *Single Instruction, Multiple Data*
- One instruction can act on multiple data
- SSE in x64
- can be used together with IPC

# PADDD
- Packed ADD D-word
- "Wide" instruction
  - can use multiple accumulators
- Saves work
  - e.g. extracting dependency chains
- ![vector example](img/vector_paddd.png)

# Subsets
|---------+-----------+-----------|
| subset  | bit width | supported |
|---------+-----------+-----------|
| SSE     |        4x | Common    |
| AVX     |        8x | Common    |
| AVX-512 |       16x | Uncommon  |
|---------+-----------+-----------|
#32 bit integer/float
- 4*32 = 128bits
- Making smaller instructions can use bit width.
- Typical that you cannot get full x improvements (x2, x4, ...)

# Difficulty
- SIMD does not care about how data is organized
- easy with adds```
# 8. [Caching](https://www.computerenhance.com/p/caching)
*speed of instructions*

Load/Store:
- how CPU gets (load) or puts (store) data from memory

Every add is dependent on the load
- it needs data from previous load

Because there are many dependencies on loads it is very important.
- *Cache*!
  - Way faster than main memory (DIMMs)

# Cache
- Register file :
  - produce values really quickly and feed them to registers
  - maximum speed
  - few hundred values at most

# 9. [Monday Q&A #2 (2023-02-12)](https://www.computerenhance.com/p/monday-q-and-a-2-2023-02-12)
# Why would the register renamer not solve the dependencies?
- Because there is a "literal" dependency
- *register renamer* fixes "fake" dependencies

# Python over cpp
- cpp is quite bad, but allows control over the output
- python is good for sketching and libraries

# Hardware jungle
- Coding towards the minimum specification
  - generally true
- Design hotspots for platforms (e.g. Xbox, PS5, ...)
- vectorizable, enough in loop for IPC
  - focus on things that work on all CPUs

# More complicated loops
- For now it's demonstrations
- everything can be optimized (:
    
# How can you tell if you are wasteful?
- profiling
  - "how much time spending in this loop?"

# Lanes and bottlenecks during design
- how many of "those" can I do per second
  - which is the limiting factor = bottleneck

# Asymptotic performance
- also important

# Power usage reduction
- in general achieved through reduced instructions
- same thing the *majority* of the time
  - reducing waste

# Signed and unsigned integers
- are the same because of *two's complement*
- except for:
  - mul/div/gt
- saturated add :: stops at lowest/highest value
- name of instructions tells the compiler which instruction to use
- unsigned/signed is not needed and could be replaced by a different operator

# Can compilers SIMD?
- gcc and clang are agressive at vectorization
  - generally better than nothing

# SIMD: AMD vs Intel
- no. (long-term)

# Are unused registers ignored?
- modern chips (CPU/GPU) have two types of registers:
  - Scalar
    - slot
    - more than Vector ones
  - SIMD ("Vector")
    - 8/16/32/64/128/256 of the 256bits
    - Using larger bits is more expensive
    - VZeroUpper after using different sizes
  - Special considerations per register
  - tip: *SIMD if you can*


# CPU vs GPU
- GPUs are the same for parallellism but with different trade-offs
  - 1024bit Vectors / Wide execution units
  - CPU
    - high clock
    - high IPC/IPL
  - GPU
    - more ALUs (on the chip)
    - more queues (pipelining)
    - more hyperthreads
- CPU were designed for single core execution
- GPU does not look ahead, but is told
- *Both* are SIMD CPUs
- *benefits:*
  - massive parallellization
  - lots of math ops
- Switch can be difficult (talkin between both)
  - unless APU

# Non-deterministic architectures
- You cannot depend on timings
- Potentially depends on the room's temperature
- Things run until they cannot melt

# Arm
- everything transfers
- Instructions name change

# SIMD without SIMD
- leave one bit for overflow
- SIMD registers handle overflows and carrying
  
# Slowdown when 256/512 registers
- Most machines downclocks when using AVX
  
# Hardware Jungle: SIMD edition
- 'cpu_id' tells what instructions sets the CPU supports
  - set function pointers accordingly
- SHIM? 


# Micro-OP debugging
- assembly instructions are a layer of debugging on top of micro-OPs
- You cannot micro-op debug 

# Out-of-order CPUs
- only if no one can tell the order was violated
- inside a window
- limited:
  - how many things at once
- but, retiring finished instructions happen in order
- rolling window
  - waiting for instructions to be done

# SIMD dataset requirements
- "Shot"
- "Tail"
- You can padd you inputs
- Mass loads/stores (on modern instruction sets)
  - will write bit per lane
  - along with masks you can choose lanes
- Vector/Packed instructions set
  - Packed: operate on x elements (with a mask)
  - Vector: VLen (Vector Length) says how much elements
- Scaler loop in worst case scenario

# Cost of SIMD
- 128 can always be used
- clock penalty goes away over time
- more latent

# Latency vs Pipelining
- Latency can be beaten by pipelining
- if the instructions are independent then the latency does not matter

# Instructions Limits
- there is a limit on the number of micro-ops a cycle
- 5 micro-ops cannot be adds
  - load tides up the execution port
- Registers are next to the lanes

# Cache control
- responsive, guesses
- "hints"
  - prefetch instruction :: tries to get the memory in cache
    - what level (not always followed, hint!)
    - when look ahead is not going to see the loads
  - streaming instruction :: forbid to load the data in the cache
    - opposite of prefetch
  - streaming store :: a store that is not going to be read in the future
- only matters when you have data you *do* want to cache


# Data going around cache
- bandwith is way bigger than the amount that needs to be passed through
- bandwith becomes narrower
- [cache_wideness][img/cache_wideness.png]
- the place having the data decides the bandwith

# Prefetches
- hardware and software
- hardware ::
  - looks at the pattern of pages
  - linearly (ascending/descending/skipping/...) 
- eliminates latency
- throughput stays the same (cache bandwith)

# Other programs
- processor does not care about *what* it caches
- you lose depending on the core
- programms waking up takes up cache
- speed decrease is not in ^2 but size of bytes per cycle

# Cache runtime
- The slower the routine, the less the cache is important

# Cache lines
- every 64 bytes
- being memory aligned penalties :: (not very high)
  - 4096B (page boundaries)
- can pollute cache
  - waste of cache space
- [[file:./img/cache_lines.png][cache_lines]]
- best to use all the cache lines (all 64 bytes)
  - via data structures

# Checking the cache
- checking and getting happens in one instruction

# Cache fights
- in multithreaded beware of shared caching
- pulling in the cache can evict the current data

# L1 cache supremacy
- constrained by distance

# Instructions in cache
- Front end ::
  - ICache
- Back end ::
  - DCache
- Separate for the L1
- Unpredicted instructions can slow down the program
- In L2/L3 cache instructions take space.

# Performance for different sizes
- smaller sizes are more likely to be cached
- if the cache is *primed* then yes

# Cache behaviour
- branch predictor :: which way
  - sophisticated
- hardware prefetcher :: which memory is going to be touched
  - recognizes patterns and puts *next* memory in cache
  - not smart
- "warmed up"

# Persistent cache
- OS wipes out with new memory map
    
# Ask for cache
- Evict train
  - evict, ask L2, evict, ask L3, evict ask memory, fill evicts
- DMA (Direct Memory Acces)

# Inclusive vs Exclusive Caches
- exclusive cache ::
  - data is not in L2
  - only when evicted from L1
- inclusive cache ::
  - L1 and L2 are filled with the data
- per chip
# 10. [Multithreading](https://www.computerenhance.com/p/multithreading)
*Increasing speed of instructions*

# Multithreading
- Core :: different computers
  - physical
- Threads :: interface to access cores through the OS
  - OS

# Speeding up
- not x2 x4, actually account cache for speed up
- more SIMD registers, instructions cache, ...
- shared caches add up
- memory bandwidth can be bottleneck
  - sometimes does not add up

# Forcing out of memory
- bandwith does not increase a lot when using main memory
  - depending on the chip
- L3 cache and main memory are shared (not big speed ups)```
# 11. [Python Revisited](https://www.computerenhance.com/p/python-revisited)
Assembly is what determines the speed.

# Python
- doing every sum in python is slow
- numpy is faster when you have supplied the array with a type```
# 12. [Monday Q&A #3 (2023-02-20)](https://www.computerenhance.com/p/monday-q-and-a-3-2023-02-20)
# Hyperthreading & Branch prediction
- hyperthreads ::
  - [[./img/hyperthreading.png][hyperthreading]]
  - polling for more than one instructions
  - very important in GPUs
  - fill the execution ports with multiple instruction streams
    - both go to the front end
- branch prediction ::
  - [[file:./img/branch_prediction.png][branch_prediction]]
  - uops arrive faster than they are executed
    - they can be processed
  - 1. stall on jumps
    - flush uops (10-14 cycles)
    - bad for out-of-order/IPL
  - 2. guess
    - wrong = stall
- front end feeds instructions into micro-ops to the back end
- IPC: more execution ports filled


# Multithreaded
- code so that threads do not talk to each other
  - communication is a mistake
- sync the code


# Max multiplier mulitthreading
- fetching memory is slower than computation
- look at all-core-bandwith
  - total bandwith to all cores
  - divided by cores = max memory per cycle
    

# Logical processors vs Cores
- Cores = computers
- Logical processors
  - OS / threads / instruction streams


# thread count > L1 cache
- oversubscription ::
  - when the program asks for more threads than available
  - lot of eviction
  - OS overhead
  - bad *always*
    - unless waiting thread
- OS tries to run thread as long as possible
  

# Green thread / fibers
- software control swapping of the OS


# Multithreadeding with disks
- micromanagement
  - when CPU has to decrypt
- depends on how disk works
  - autonomous/not
- threads can make non blocking code


# How to get memory bandwidth
- https://github.com/cmuratori/blandwidth```
# 13. [The Haversine Distance Problem](https://www.computerenhance.com/p/the-haversine-distance-problem)
- Computing arc length between two coordinates.
- You want to do the math first.
    - CPU is made for it
- Second is the *Input*
- Reading the data can take a long time.```
# 14. ["Clean" Code, Horrible Performance](https://www.computerenhance.com/p/clean-code-horrible-performance)
# 15. [Instruction Decoding on the 8086](https://www.computerenhance.com/p/instruction-decoding-on-the-8086)
The 8086 instruction set architecture is easier.
- Better for understanding concepts.

# Register
- place to store information
- 16 bits on the 8086

# Operations
1. load memory
   - copy into register
2. compute
3. write to memory
   
# Instruction Decode
Turning the instruction stream into hardware operations.

# Instructions
- mov ::
  - move, but actually a /copy/
- are assembled into binary that the /Instruction Decoder/ can use to
  execute the instruction
- stored in 2x 8bits
  - [[./img/instruction_encoding.png][image]]
  - instruction (6) :: code for the instruction
  - flags
    - D (1) :: whether REG is source or destination
    - W (1) :: 16bits or not
  - second byte:
    - MOD (2) :: memory or register operation
    - REG (3) :: encodes register
    - R/M (3) :: register/memory operation
      - operand
    - AX/AL/AH ::
      - X: wide
      - L: low bits
      - H: high bits

Binary Instruction stream
- only register to register moves

Exercise:
- read binary in
- bit manipulation to extract the bits
- *assemble the listings
- load 2 bytes and disassemble that instruction
  - outputs the instructions
# 16. [Decoding Multiple Instructions and Suffixes](https://www.computerenhance.com/p/decoding-multiple-instructions-and)
1st byte tells if there's a second, 2nd if there's a 3rd, ...
-> makes decoding dependent process, A cost on the CPU

The D bit is the difference between a store and a load

- Effective address calculation :: Adress that needs to be computed
  before it can be resolved, e.g. [BP + 75] (this is also a displacement)

MOD field 

- displacement ::
  - [ ... + n] where n is a n-bit number either 0, 8 or 16 bits
  - defined by the MOD field
    - direct address still has a displacement (MOD = 00)
    - BP 110, has this 16bits
  - [displacement][img/displacement.png]

- Some registers can be adressed as their low or high bits (L/H)
  - [[file:./img/l_h_registers.png][l_h_registers]]

The R/M field encodes what type of displacement. (BP, BX, SI, DI)

There are two sets of registers, ones where you can address or low and high parts freely
like (AH, AL, AX: A to D).
And SP, BP, SI, DI.
*Some registers are not created equal.*

Special case when MOD is 00 and R/M is 110 -> 16 bit displacement.

Immediately (available) value.

# Assignment
Also implement memory to register.
- going to have to read the displacement bits (DISP)

When reassembling signed/unsigned information will be lost.
- easier to test

## Challenge (extra)
- look up how negative displacements work in the manual
- byte/word is a different move
- different instruction for accumulator
  - to save a byte
# 17. [Monday Q&A #4 (2023-03-06)](https://www.computerenhance.com/p/monday-q-and-a-4-2023-03-06)
# 18. [Opcode Patterns in 8086 Arithmetic](https://www.computerenhance.com/p/opcode-patterns-in-8086-arithmetic)
# 19. [Monday Q&A #5 (2023-03-13)](https://www.computerenhance.com/p/monday-q-and-a-5-2023-03-13)
# 20. [8086 Decoder Code Review](https://www.computerenhance.com/p/8086-decoder-code-review)
Enum + bits size, eg. (Byte_Lit, 6)
Using a segmented access so access to memory can be controlled.
Printable instructions / Non-printable for things like segment prefixes.  Which also require a context to be passed since you can have any amount of them prefixing an instruction.
MOD R/M field is encoded as effective_address_expression from effective_address_base.
Register access with offset & count for having for accessing low/high 8 bits or 16 bits.
Instruction operand as an union and type.
1MB memory buffer, with assert for out of bounds memory access.
Separate text code for printing out instructions into text assembly.
Compiler warns if not all enums are handled in a switch statement.
Appending superfluous "word" or "byte" simplifies the logic.
Implicit field for shortcut instructions like "mov to accumulator".
A shift value can be used to shift instructions around in case of instructions spreading multiple bytes like the escape instruction.
The table's bytes are big endian, but the bytes are little endian.
Last operand is the operand that was not used. 
# 21. [Monday Q&A #6 (2023-03-20)](https://www.computerenhance.com/p/monday-q-and-a-6-2023-03-20)
Bytecode is packed bytes that tells a CPU what to do.
# 22. [Using the Reference Decoder as a Shared Library](https://www.computerenhance.com/p/using-the-reference-decoder-as-a)
# 23. [Simulating Non-memory MOVs](https://www.computerenhance.com/p/simulating-non-memory-movs)
CPU only understand memory in register and simple operations on these bytes.1
# 24. [Homework Poll!](https://www.computerenhance.com/p/homework-poll)
# 25. [New Schedule Experiment](https://www.computerenhance.com/p/new-schedule-experiment)
# 26. [Simulating ADD, SUB, and CMP](https://www.computerenhance.com/p/simulating-add-jmp-and-cmp)
Signed flag is set when the highest bit (sign bit) is set.

# 27. [Simulating Conditional Jumps](https://www.computerenhance.com/p/simulating-conditional-jumps)
# 28. [Response to a Reporter Regarding "Clean Code, Horrible Performance"](https://www.computerenhance.com/p/response-to-a-reporter-regarding)
# 29. [Monday Q&A #7 (2023-04-10)](https://www.computerenhance.com/p/monday-q-and-a-7-2023-04-10)
# 30. [Simulating Memory](https://www.computerenhance.com/p/simulating-memory)
# 31. [Simulating Real Programs](https://www.computerenhance.com/p/simulating-real-programs)
# 32. [Monday Q&A #8 (2023-04-17)](https://www.computerenhance.com/p/monday-q-and-a-8-2023-04-17)
# 33. [Other Common Instructions](https://www.computerenhance.com/p/other-common-instructions)
# 34. [The Stack](https://www.computerenhance.com/p/the-stack)
# 35. [Monday Q&A #9 (2023-04-24)](https://www.computerenhance.com/p/monday-q-and-a-9-2023-04-24)
# 36. [Performance Excuses Debunked](https://www.computerenhance.com/p/performance-excuses-debunked)
# 37. [Estimating Cycles](https://www.computerenhance.com/p/estimating-cycles)
# 38. [Monday Q&A #10 (2023-05-08)](https://www.computerenhance.com/p/monday-q-and-a-10-2023-05-08)
# 39. [From 8086 to x64](https://www.computerenhance.com/p/from-8086-to-x64)
# 40. [8086 Internals Poll](https://www.computerenhance.com/p/8086-internals-poll)
# 41. [How to Play Trinity](https://www.computerenhance.com/p/how-to-play-trinity)
# 42. [Monday Q&A #11 (2023-05-15)](https://www.computerenhance.com/p/monday-q-and-a-11-2023-05-15)
# 43. [8086 Simulation Code Review](https://www.computerenhance.com/p/8086-simulation-code-review)
# 44. [Part One Q&A and Homework Showcase](https://www.computerenhance.com/p/part-one-q-and-a-and-homework-showcase)
# 45. [The First Magic Door](https://www.computerenhance.com/p/the-first-magic-door)
# 46. [Monday Q&A #12 (2023-05-22)](https://www.computerenhance.com/p/monday-q-and-a-12-2023-05-22)
# 47. [Generating Haversine Input JSON](https://www.computerenhance.com/p/generating-haversine-input-json)
# 48. [Monday Q&A #13 (2023-05-29)](https://www.computerenhance.com/p/monday-q-and-a-13-2023-05-29)
# 49. [Writing a Simple Haversine Distance Processor](https://www.computerenhance.com/p/writing-a-simple-haversine-distance)
# 50. [Monday Q&A #14 (2023-06-05)](https://www.computerenhance.com/p/monday-q-and-a-14-2023-06-05)
# 51. [Initial Haversine Processor Code Review](https://www.computerenhance.com/p/initial-haversine-processor-code)
# 52. [Monday Q&A #15 (2023-06-12)](https://www.computerenhance.com/p/monday-q-and-a-15-2023-06-12)
# 53. [Introduction to RDTSC](https://www.computerenhance.com/p/introduction-to-rdtsc)
# 54. [Monday Q&A #16 (2023-06-19)](https://www.computerenhance.com/p/monday-q-and-a-16-2023-06-19)
# 55. [How does QueryPerformanceCounter measure time?](https://www.computerenhance.com/p/how-does-queryperformancecounter)
# 56. [Monday Q&A #17 (2023-06-26)](https://www.computerenhance.com/p/monday-q-and-a-17-2023-06-26)
# 57. [Instrumentation-Based Profiling](https://www.computerenhance.com/p/instrumentation-based-profiling)
# 58. [Monday Q&A #18 (2023-07-03)](https://www.computerenhance.com/p/monday-q-and-a-18-2023-07-03)
# 59. [Profiling Nested Blocks](https://www.computerenhance.com/p/profiling-nested-blocks)
# 60. [Monday Q&A #19 (2023-07-10)](https://www.computerenhance.com/p/monday-q-and-a-19-2023-07-10)
# 61. [Profiling Recursive Blocks](https://www.computerenhance.com/p/profiling-recursive-blocks)
# 62. [Monday Q&A #20 (2023-07-17)](https://www.computerenhance.com/p/monday-q-and-a-20-2023-07-17)
# 63. [A First Look at Profiling Overhead](https://www.computerenhance.com/p/a-first-look-at-profiling-overhead)
# 64. [New Q&A Process](https://www.computerenhance.com/p/new-q-and-a-process)
# 65. [A Tale of Two Radio Shacks](https://www.computerenhance.com/p/a-tale-of-two-radio-shacks)
# 66. [Comparing the Overhead of RDTSC and QueryPerformanceCounter](https://www.computerenhance.com/p/comparing-the-overhead-of-rdtsc-and)
# 67. [Monday Q&A #21 (2023-07-31)](https://www.computerenhance.com/p/monday-q-and-a-21-2023-07-31)
# 68. [The Four Programming Questions from My 1994 Microsoft Internship Interview](https://www.computerenhance.com/p/the-four-programming-questions-from)
# 69. [Microsoft Intern Interview Question #1: Rectangle Copy](https://www.computerenhance.com/p/microsoft-intern-interview-question)
# 70. [Microsoft Intern Interview Question #2: String Copy](https://www.computerenhance.com/p/microsoft-intern-interview-question-ab7)
# 71. [Microsoft Intern Interview Question #3: Flood Fill Detection](https://www.computerenhance.com/p/microsoft-intern-interview-question-a3f)
# 72. [Efficient DDA Circle Outlines](https://www.computerenhance.com/p/efficient-dda-circle-outlines)
# 73. [Q&A #22 (2023-08-15)](https://www.computerenhance.com/p/q-and-a-22-2023-08-15)
# 74. [Measuring Data Throughput](https://www.computerenhance.com/p/measuring-data-throughput)
# 75. [Q&A #23 (2023-08-21)](https://www.computerenhance.com/p/q-and-a-23-2023-08-21)
# 76. [Repetition Testing](https://www.computerenhance.com/p/repetition-testing)
# 77. [Q&A #24 (2023-08-28)](https://www.computerenhance.com/p/q-and-a-24-2023-08-28)
# 78. [Monitoring OS Performance Counters](https://www.computerenhance.com/p/monitoring-os-performance-counters)
# 79. [Q&A #25 (2023-09-04)](https://www.computerenhance.com/p/q-and-a-25-2023-09-04)
# 80. [Page Faults](https://www.computerenhance.com/p/page-faults)
# 81. [Q&A #26 (2023-09-11)](https://www.computerenhance.com/p/q-and-a-26-2023-09-11)
# 82. [Probing OS Page Fault Behavior](https://www.computerenhance.com/p/probing-os-page-fault-behavior)
# 83. [Game Development Post-Unity](https://www.computerenhance.com/p/game-development-post-unity)
# 84. [Q&A #27 (2023-09-18)](https://www.computerenhance.com/p/q-and-a-27-2023-09-18)
# 85. [Four-Level Paging](https://www.computerenhance.com/p/four-level-paging)
# 86. [Q&A #28 (2023-09-25)](https://www.computerenhance.com/p/q-and-a-28-2023-09-25)
# 87. [Analyzing Page Fault Anomalies](https://www.computerenhance.com/p/analyzing-page-fault-anomalies)
# 88. [Q&A #29 (2023-10-02)](https://www.computerenhance.com/p/q-and-a-29-2023-10-02)
# 89. [Powerful Page Mapping Techniques](https://www.computerenhance.com/p/powerful-page-mapping-techniques)
# 90. [Q&A #30 (2023-10-09)](https://www.computerenhance.com/p/q-and-a-30-2023-10-09)
# 91. [Faster Reads with Large Page Allocations](https://www.computerenhance.com/p/faster-reads-with-large-page-allocations)
# 92.  [#31 (2023-10-23)](Q&A)
# 93. [ttps://www.computerenhance.com/p/q-and-a-31-2023-10-23]()
# 94. [Memory-Mapped Files](https://www.computerenhance.com/p/memory-mapped-files)
# 95. [Q&A #32 (2023-10-30)](https://www.computerenhance.com/p/q-and-a-32-2023-10-30)
# 96. [Inspecting Loop Assembly](https://www.computerenhance.com/p/inspecting-loop-assembly)
# 97. [Q&A #33 (2023-11-06)](https://www.computerenhance.com/p/q-and-a-33-2023-11-06)
# 98. [Intuiting Latency and Throughput](https://www.computerenhance.com/p/intuiting-latency-and-throughput)
# 99. [Q&A #34 (2023-11-13)](https://www.computerenhance.com/p/q-and-a-34-2023-11-13)
# 100. [Analyzing Dependency Chains](https://www.computerenhance.com/p/analyzing-dependency-chains)
# 101. [Q&A #35 (2023-11-20)](https://www.computerenhance.com/p/q-and-a-35-2023-11-20)
# 102. [Linking Directly to ASM for Experimentation](https://www.computerenhance.com/p/linking-directly-to-asm-for-experimentation)
# 103. [Q&A #36 (2023-11-27)](https://www.computerenhance.com/p/q-and-a-36-2023-11-27)
# 104. [CPU Front End Basics](https://www.computerenhance.com/p/cpu-front-end-basics)
# 105. [A Few Quick Notes](https://www.computerenhance.com/p/a-few-quick-notes)
# 106. [Q&A #37 (2023-12-04)](https://www.computerenhance.com/p/q-and-a-37-2023-12-04)
# 107. [Branch Prediction](https://www.computerenhance.com/p/branch-prediction)
# 108. [Q&A #38 (2023-12-11)](https://www.computerenhance.com/p/q-and-a-38-2023-12-11)
# 109. [Code Alignment](https://www.computerenhance.com/p/code-alignment)