4kuop cache is more impressive if you see where the ucode ROM is: the four decoders probabily output just a pointer (and can probabily do 4 vs 1 of INTEL), only a pointer go in uopcache (vs whole sequence in INTEL). For complex instructions this is huge advantage...
AMD had to do this because most x86 code is optimized for INTEL 4-1-1-1(-1) burst. All 4 decoders are large. To make 4 large decoders without duplicating microcode ROM, you just output pointers and put the ROM down in the pipeline. Bonus: in uopcache goes the pointer and not code
4kuop cache is more impressive if you see where the ucode ROM is: the four decoders probabily output just a pointer (and can probabily do 4 vs 1 of INTEL), only a pointer go in uopcache (vs whole sequence in INTEL). For complex instructions this is huge advantage...
AMD had to do this because most x86 code is optimized for INTEL 4-1-1-1(-1) burst. All 4 decoders are large. To make 4 large decoders without duplicating microcode ROM, you just output pointers and put the ROM down in the pipeline. Bonus: in uopcache goes the pointer and not code
Comment