Click for Portable Storage Solutions!
Click for Portable Storage Solutions!


A Click shows your site support to my Sponsors

Accelerate Your Mac! - the source for performance news and reviews
The Source for Mac Performance News and Reviews
PowerPC G4 vs G3
How the PowerPC 7400/G4 differs from the PowerPC 750/G3
Published: 9/08/99

Note! This is a 1999 written article. For the latest info and performance tests of G3 vs G4 CPUs - see the G4 Upgrade reviews page, which links to reviews that often compare G4 CPUs to G3s in applications, games and benchmarks. (The OWC G4/500 vs G3/500 review linked there is one example of this.) The Systems page also has several G4 system related articles.

(1999 article follows)

Several readers commented on my first G4 vs G3 performance tests; those that were aware of the G4 and G3 Motorola specs didn't seem surprised at the results of tests in applications that didn't use Altivec (G4 specific) instructions. One reader in particular wrote with details from Motorola documents on the CPUs that I'm listing here to help explain the similar performance when running most of today's current (non-Altivec) software and to highlight the differences in the two processors.

Eric Harruff wrote (emphasis mine):

"This is from the MPC7400 RISC Microprocessor Technical Summary (available on their web site):"

1.13 Differences between the MPC7400 (G4) and the MPC750 (G3)

The design philosophy on MPC7400 is to change from the MPC750 base only where required to gain compelling multimedia and multiprocessor performance. MPC7400's core is essentially the same as the MPC750's, except that whereas the MPC750 has a 6-entry completion queue and has slower performance on some floating-point double-precision operations, MPC7400 has an 8-entry completion queue and a full double-precision FPU. MPC7400 also adds the AltiVec instruction set, has a new memory subsystem (MSS), and can interface to an improved bus, the MPX bus. The following sections discuss the major changes in more detail.

Core Sequencing
The MPC750 has a 6-entry instruction buffer and a 6-entry completion queue. For each clock, it can fetch four instructions, dispatch two instructions, fold one branch, and complete two instructions. MPC7400 is identical, except for an eight-entry completion buffer. The extra completion queue entries reduce the opportunity for bottlenecks by MPC7400's additional execution units.

FPU
On the MPC750, single-precision operations involving multiplication have a 3-cycle latency, while their double-precision equivalents take an additional cycle. Because MPC7400 has a full double-precision FPU, double-precision multiplies have the same latency as single-precision multiplies: 3 cycles. Floating-point divides, on the other hand, still have the same latency for the two designs (17 cycles for single-precision, 31 cycles for double-precision).

MPC750Double-precision floating-point multiply:
All other floating-point add and multiply:
4 cycles
3 cycles
 
MPC7400 All floating-point add and multiply: 3 cycles

AltiVec
MPC7400 implements all instructions defined by the AltiVec specification. Two dispatchable AltiVec functional units were added, a vector permute unit (VPU) and a vector ALU unit VALU. The VALU comprises a simple integer unit, a complex integer unit, and a floating-point unit. MPC7400 also adds a vector register (VR) file consisting of 32 128-bit VRs, along with 6 VR rename registers. The VPU handles permute and shift operations and the VALU handles calculations. The LSU handles AltiVec load and store operations. To support AltiVec operations, all memory subsystem (MSS) data buses are 128 bits wide (as opposed to 64 bits in the MPC750). Also, additional queues have been added and queue sizes have been increased to sustain heavy AltiVec usage.

AltiVec is designed to improve the performance of vector-intensive code, as can be seen in such applications as multimedia and digital signal processing. AltiVec-targeted code can speed up many two-dimensional and three-dimensional graphics functions 3-5 times, especially core functions in 3-D engines and game-related 2-D functions.

Memory Subsystem (MSS)
MPC7400 has a new memory subsystem designed to support AltiVec work loads, the new MPX bus protocol, and 5-state multiprocessing capabilities. Queues and queue sizes are designed to support more efficient data flow. For example, the MPC750 has a three-entry LSU store queue, while MPC7400 has a six-entry LSU store queue. One such buffer that MPC7400 adds (which the MPC750 lacks) is an eight-entry reload buffer, where L1 data cache misses can reside while they wait for their data to be loaded. This enables two features: load miss folding and store miss merging.

Load miss folding
In the MPC750, if a second load misses to the same cache block, the second load must wait for the critical word of the first load before it can access its data, and subsequent accesses are also stalled. In MPC7400, the first load or store causes an entry to be allocated in the reload buffer. A subsequent load to the same cache block is placed aside in the load fold queue (LFQ), and it can return its data immediately when available. Also, subsequent accesses to the cache are not blocked and can be processed.

For example, on the MPC750 if a load or store (access A) misses in the data cache. Then a subsequent load (access B) to the same cache block must wait until the critical word for A is retired. Because of this, any subsequent loads or stores after access B also cannot access the data cache until the reload for access A completes. On the other hand, with MPC7400, load or store access A misses in the data cache, and while the data is coming back, up to four subsequent misses to the same cache block can be folded into the LFQ, and subsequent instructions can access the data cache. Loads are blocked only when the reload table or the LFQ are full.

Store miss merging
In the MPC750, if a second store misses to the same cache block, it must wait for the critical word of the first store before it can write its data. MPC7400 can merge several stores to the same cache block into the same entry in its reload buffer. If enough stores merge to write all 32 bytes of the cache block (usually via two back-to-back AltiVec store misses), then no data needs to be loaded from the bus and an address-only transaction (KILL) is broadcast instead.

Cache Allocate on reload
Both designs have the same L1 cache size, but differ in their block allocation policy: The MPC750 has an allocate-on-miss policy, while MPC7400 has allocate-on-reload policy, which allows better cache allocation and replacement and more efficient use of data bus bandwidth.

If access A misses in the cache, the MPC750 immediately identifies the victim block (call it X) if there is one and allocates its space for the new data (call it Y) to be loaded. If a subsequent access (access B) needs this victim block, even if access B occurs before Y has been loaded, then it will miss because as soon as X is allocated it is no longer valid. After Y has loaded (and, if X is modified, after X has been cast out), X must be reloaded, and B must wait until its data is valid again.

MPC7400, on the other hand, delays allocation/victimization until the block reload occurs. In the example above, while Y is being loaded, B can hit block X, and a different block is victimized. This allows more efficient use of the cache and can reduce thrashing. On MPC7400, allocation occurring in parallel with reload which uses the cache more efficiently.

MPC750
  
MPC7400
1-cycle load arbitration   1-cycle load arbitration
1-cycle allocate/victimize
4-beat reload (64 bits/beat)
  
4-beat reload
Total = 6 cycles
  
Total = 5 cycles

Outstanding misses
The MPC750 allows one D-side miss and one I-side miss to be outstanding (accessing the L2 or the bus) at any given time. MPC7400 allows one I-side miss and up to eight D-side misses (maximum of 8 total). Note that the L2 can queue up to four hits, but with a fast L2 (1:1 mode) it is impossible to Äll this queue with data cache misses. The L2 miss queue can queue four transactions that are waiting to access the processor address bus.

Miss under miss
While processing a miss, the MPC750¹s data cache allows subsequent loads and stores to hit in the data cache (hit under miss), but it blocks on the next miss until the first miss finishes reloading. MPC7400 allows subsequent accesses that miss in the data cache to propagate to the L2 and beyond (miss under miss).

L2 cache
Compared to the MPC750, MPC7400 has twice the number of on-chip L2 tags per way (8192), which can support twice the L2 cache size (up to 2 Mbyte). The sectoring configuration is slightly different, too:

MPC750
  
MPC7400
1 Mbyte4 sectors/tag  2 Mbyte4 sectors/tag
512 Kbyte2 sectors/tag  1 Mbyte2 sectors/tag
256 Kbyte2 sectors/tag  512 Kbyte1 sector/tag

Fewer sectors per tag allows the cache to be used more efficiently MPC7400 and the MPC750 also have different cache reload policies. On the MPC750, an L1 cache miss that also misses in the L2 causes a reload from the bus to both L1 and L2. On MPC7400, misses to the L1 instruction cache behave the same way, but misses to the L1 data cache behave differently: data is reloaded into the L1 only. Thus, with respect to the L1 data cache, the L2 holds only blocks that are cast out; it acts as a giant victim cache for the L1 data cache. This improves performance because the same data is duplicated in the L1 data cache and L2 less often.

60x bus/MPX (MaxBus) bus
MPC7400 supports the 60x bus used by the MPC750, but it also supports a new bus (MPX bus). It implements a 5-state cache-coherency protocol (MERSI) and the MESI and MEI subsets. This provides better hardware support of multiprocessing.

For example, MPX bus supports data intervention. On the 60x bus, if one processor does a read of data that is marked Modified in another processor's cache, the transaction is retried and the data is pushed to memory, after which the transaction is restarted. MPX bus allows the data to be forwarded directly to the requesting processor from the processor that has it cached. (MPC7400 also supports intervention for data marked Exclusive or Shared.)

The MPC7400 can support up to seven simultaneous transactions on the bus interface (60x or MPX bus) while the MPC750 supports only two.


For more information on the G4 (PowerPC 7400) or G3 (PowerPC 750) CPUs see the related links below.


Related Links:


Back to WWW.XLR8YOURMAC.COM

Copyright © Mike, 1999.

No part of this sites content or images are to be reproduced or distributed in any form without written permission.
All brand or product names mentioned here are properties of their respective companies.

Users of the web site must read and are bound by the terms and conditions of use.