diff options
| -rwxr-xr-x | build/sim86 | bin | 61656 -> 66448 bytes | |||
| -rwxr-xr-x | build/sim86_meta | bin | 177464 -> 181968 bytes | |||
| -rw-r--r-- | computerenhance.md | 46 | ||||
| -rw-r--r-- | src/sim86.cpp | 39 |
4 files changed, 72 insertions, 13 deletions
diff --git a/build/sim86 b/build/sim86 Binary files differindex 08e2c7c..3777fdb 100755 --- a/build/sim86 +++ b/build/sim86 diff --git a/build/sim86_meta b/build/sim86_meta Binary files differindex c7b2e28..d01767d 100755 --- a/build/sim86_meta +++ b/build/sim86_meta diff --git a/computerenhance.md b/computerenhance.md index 1733704..8eaa819 100644 --- a/computerenhance.md +++ b/computerenhance.md @@ -28,7 +28,7 @@ Only learning about how performance works is enough. # Solution - Keep result of instructions in mind, not code -- Learn what the maximum speed of something should be``` +- Learn what the maximum speed of something should be # 4. [Waste](https://www.computerenhance.com/p/waste) Instructions that do not need to be there. @@ -53,7 +53,7 @@ Key points: - to measure overhead + loop we can measure cycles - more instructions != more time -Python had 180x instructions and was 130x slower.``` +Python had 180x instructions and was 130x slower. # 5. [Instructions Per Clock](https://www.computerenhance.com/p/instructions-per-clock) *speed of instructions* @@ -76,13 +76,13 @@ Python had 180x instructions and was 130x slower.``` Reducing ratio of loop overhead / work - example: loop unrolling - ```c +```c for (i = 0; i < count; i +=2) { sum += input[i]; sum += input[i + 1]; } - ``` +``` Weird that it would go until to 1x add per cycle. - what are the chances? overhead?? @@ -98,7 +98,7 @@ Multiple chains can help break through limits. - "boosting the IPL" CPUs are designed for more computation so boosting IPL in a loop that -does not do a lot of computation will bring less benefits.``` +does not do a lot of computation will bring less benefits. # 6. [Monday Q&A (2023-02-05)](https://www.computerenhance.com/p/monday-q-and-a-2023-02-05) # JIT @@ -168,7 +168,7 @@ input[3] input += 4 ``` # Three-based addition: -- common technique to work out a dependency chain``` +- common technique to work out a dependency chain # 7. [Single Instruction, Multiple Data](https://www.computerenhance.com/p/single-instruction-multiple-data) *Amount of instructions* @@ -202,7 +202,7 @@ input += 4 # Difficulty - SIMD does not care about how data is organized -- easy with adds``` +- easy with adds # 8. [Caching](https://www.computerenhance.com/p/caching) *speed of instructions* @@ -479,14 +479,14 @@ Because there are many dependencies on loads it is very important. # Forcing out of memory - bandwith does not increase a lot when using main memory - depending on the chip -- L3 cache and main memory are shared (not big speed ups)``` +- L3 cache and main memory are shared (not big speed ups) # 11. [Python Revisited](https://www.computerenhance.com/p/python-revisited) Assembly is what determines the speed. # Python - doing every sum in python is slow -- numpy is faster when you have supplied the array with a type``` +- numpy is faster when you have supplied the array with a type # 12. [Monday Q&A #3 (2023-02-20)](https://www.computerenhance.com/p/monday-q-and-a-3-2023-02-20) # Hyperthreading & Branch prediction - hyperthreads :: @@ -550,14 +550,14 @@ Assembly is what determines the speed. # How to get memory bandwidth -- https://github.com/cmuratori/blandwidth``` +- https://github.com/cmuratori/blandwidth # 13. [The Haversine Distance Problem](https://www.computerenhance.com/p/the-haversine-distance-problem) - Computing arc length between two coordinates. - You want to do the math first. - CPU is made for it - Second is the *Input* -- Reading the data can take a long time.``` +- Reading the data can take a long time. # 14. ["Clean" Code, Horrible Performance](https://www.computerenhance.com/p/clean-code-horrible-performance) @@ -740,16 +740,38 @@ By using estimation you can know what your performance *should* be. clocks=cycles # 38. [Monday Q&A #10 (2023-05-08)](https://www.computerenhance.com/p/monday-q-and-a-10-2023-05-08) -With SIMD using smaller numbers will be faster. +With SIMD using smaller bit-widths will be faster. +For better cycle estimations it's better to try and simulate the microcode which has been reverse engineered from die shots. +2 transfers can mean read + write, eg. `add [bx], 20`. +There is microcode for loads and stores but some lines get processed and "skipped" which can account for a cycle. # 39. [From 8086 to x64](https://www.computerenhance.com/p/from-8086-to-x64) +E prefix "widens" the register to 32 bits for backwards compatibility. Analogously, the R prefix "widens" the register to 64 bits. +R8-15 or the new 64 bits registers. Suffixes: B-8, W-16, D-32, byte, word, double word, quad word +You can use any registers for the 2 terms of effective addressing. One of the terms can be a scalar for multiplying. PTR is optional for specifying it's memory. + # 40. [8086 Internals Poll](https://www.computerenhance.com/p/8086-internals-poll) + # 41. [How to Play Trinity](https://www.computerenhance.com/p/how-to-play-trinity) + # 42. [Monday Q&A #11 (2023-05-15)](https://www.computerenhance.com/p/monday-q-and-a-11-2023-05-15) +`mov edi, edi` zeroes the upper bits of a register. Since the ABI specifies edi as a parameter it needs to be 0. +Redundant register moves do not impact the backend performance (where dependency chains get resolved). +There are instructions that are not useful anymore. +Some instructions cannot be accessed. +Theres is `sgx` extension that allows to do encrypted memory, transactional memory system. +SIMD registers can be split in lanes. But in "normal" registers this is not supported anymore. +A segfault is an interrupt from the interrupt table. Eg. paging in unmapped memory + # 43. [8086 Simulation Code Review](https://www.computerenhance.com/p/8086-simulation-code-review) + # 44. [Part One Q&A and Homework Showcase](https://www.computerenhance.com/p/part-one-q-and-a-and-homework-showcase) + # 45. [The First Magic Door](https://www.computerenhance.com/p/the-first-magic-door) + # 46. [Monday Q&A #12 (2023-05-22)](https://www.computerenhance.com/p/monday-q-and-a-12-2023-05-22) +TODO: more information about 8086 misc. + # 47. [Generating Haversine Input JSON](https://www.computerenhance.com/p/generating-haversine-input-json) # 48. [Monday Q&A #13 (2023-05-29)](https://www.computerenhance.com/p/monday-q-and-a-13-2023-05-29) # 49. [Writing a Simple Haversine Distance Processor](https://www.computerenhance.com/p/writing-a-simple-haversine-distance) diff --git a/src/sim86.cpp b/src/sim86.cpp index e5480d7..7c444bd 100644 --- a/src/sim86.cpp +++ b/src/sim86.cpp @@ -157,7 +157,6 @@ Run8086(psize MemorySize, u8 *Memory) if(Decoded.Op) { u32 OldIPRegister = IPRegister; - IPRegister += Decoded.Size; #if SIM86_INTERNAL printf("%s ;", Sim86_MnemonicFromOperationType(Decoded.Op)); @@ -300,6 +299,42 @@ Run8086(psize MemorySize, u8 *Memory) } #endif } + else if(Decoded.Op == Op_ret) + { + printf("\n"); + printf("STOPONRET: Return encountered at address %d.\n", IPRegister); + + break; + } + else if(Decoded.Op == Op_inc) + { + Assert(DestinationOperand->Type == Operand_Register); + Assert(SourceOperand->Type == Operand_None); + *Destination += 1; + } + else if(Decoded.Op == Op_test) + { + + Assert(DestinationOperand->Type == Operand_Register); + Assert(SourceOperand->Type == Operand_Register || SourceOperand->Type == Operand_Immediate); + + s32 Value =((Decoded.Flags & Inst_Wide) ? + (u16)((u16)*Destination & ((u16)*Source)) : + (u8)((u8)*Destination & ((u8)*Source))); + FlagsFromValue(&FlagsRegister, Decoded.Flags, Value); + } + else if(Decoded.Op == Op_xor) + { + + Assert(DestinationOperand->Type == Operand_Register); + Assert(SourceOperand->Type == Operand_Register || SourceOperand->Type == Operand_Immediate); + + s32 Value =((Decoded.Flags & Inst_Wide) ? + (u16)((u16)*Destination ^ ((u16)*Source)) : + (u8)((u8)*Destination ^ ((u8)*Source))); + FlagsFromValue(&FlagsRegister, Decoded.Flags, Value); + *Destination = Value; + } else if(Decoded.Op == Op_cmp) { Assert(DestinationOperand->Type == Operand_Register); @@ -361,6 +396,8 @@ Run8086(psize MemorySize, u8 *Memory) Assert(0 && "Op not implemented yet."); } + IPRegister += Decoded.Size; + #if SIM86_INTERNAL printf(" ip:0x%x->0x%x", OldIPRegister, IPRegister); #endif |
