The Architecture of PlayStation 3

Home / The Architecture of PlayStation 3

The Architecture of PlayStation 3

December 5, 2015 | Article | No Comments

PlayStation 3, abbreviate as PS3, is the third generation of home video game console made by Sony Coputer Entertainment as successor of PlayStation 2. PS 3 is part of PlayStation series which competes with Microsoft’s Xbox360 and Nintendo’s Wii. Release on November 11th, 2006 in Japan.

PlayStation 3 includes unified online gaming servie, the PlayStation Network. PS3 also have multimedia capabilities, and connectable with PlayStation Portable (PSP) and PlayStation Vita (PS Vita). PS3 use Blue-ray Disc.

This article will discuss about PlayStation architecture and some important aspects.

Currently there are two version of this machine: Original and Slim model.

General Specification

PlayStation 3 features a slot-loading 2x speed Blue-ray Disc for games, Blue-ray movies, DVDs, CDs, and other optical media. Originally available with hard drives of 20 and 60GB and now with various size up to 320GB. All PS3 models have user upgradeable 2.5″ SATA hard drives.

PlayStation 3 uses Cell microprocessor join design by Sony, Toshiba, and IBM, as its CPU which is made up of one 3.2 GHz PowerPC-based “Power Processing Element” (PPE) and eight Synergistic Processing Elements (SPEs). The eight SPE is disabled to improve chip yields and only six of seven SPEs are accessible to developers as the seventh SPE is reserved by console’s operating system. The Graphic processing is handled by NVIDIA RSX ‘Reality Synthesizer’ which can produce resolution from 480i/576i SD up to 1080p HD. PlayStation 3 has 256 MB of XDR DRAM main memory and 256 of GDDR3 video memory for RSX.

XDR DRAM or Xtreme Data Rate Dynamic Random Access Memory is a high performance RAM interface and successor to the Rambus RDRAM it is based on, competing with the rival DDR2 SDRAM and GDDR4 technology. XDR DRAM is designed to be effective in small, high-bandwidth comsumer systems, high performance memory applications, and high end GPUs. Eliminates the unusually high latency problems that plagued early forms of RDRAM and also heavy emphasis on per-pin bandwidth, which can benefit further cost control on PCB production because of fewer lanes are needed for the same amount of bandwidth.

The machine also has Bluetooth 2.0 (with support for up to 7 bluetooth devices), gigabit Ethernet, USB 2.0 and HDMI 1.4 built in on all currently shipping models. Wi-Fi networking is also included in all except the 20GB models. For 60GB and CECHExx 80GB models there is flash card reader (compatible with Memory Stick, SD/MMC, and CompactFlash/Microdrive media).

Normally for each PS3 purchase packed with dual shock controllers and some other peripherals.

The Central Processing Unit

Play Station 3 utilizes Cell Microprocessors which has 3.2 GHz PowerPC based “Power Processing Element” and eight Synergistic Processing Elements (SPEs) as described before. PlayStation 3’s Cell CPU can achieves a theoretical maximum of 230.4 GFLOPS (Giga FLOPS or Giga Floating Point Operations) in single precision floating point which has make it as right choice for computational machine.

Cell Broadband Engine – more commonly known as Cell – is a microprocessor designed to bridge the gap between conventional desktop processors such as Intel and AMD processors and more specialized high-performance processors such as NVIDIA and ATI graphics-processors.

Cell processor can be split into four components: external input and output structures; main processors (Power Processing Element / PPE); eight fully co-processors (Synergistic Processing Elements / SPE); and a specialized high-bandwidth circular data bus conencting the PPE, i/o elements, and the SPEs. This bus is called Element Interconnect Bus or EIB.

To achieve high performance needed for mathematically intensive tasks, such as decode/encode MPEG streams, generating or transforming three-dimensional data, or undertaking Fourier analysis data, the Cell processors marries the SPEs and the PPE via EIB to give access via fully cache coherent DMA (direct memory access).

The PPE (Power Processing Element) is the Power Architecture based, two way multithreaded core acting as the controller for the eight SPEs which handle most of the computational workload. PPE will work with the conventional operating system due to its similarity to other 64-bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. PPE containes a 64 KiB level 1 cache (32 KiB instruction and 32 KiB data) and a 512 KiB Level 2 cache. The size of a cache line is 128 bytes.

The SPE (Synergistic Processing Elements) is the Synergistic Processing Unit, SPU, and a Memory Flow Controller, MFC (DMA, MMU, nad bus interface). The SPU runs a specially developped Instruction Set (ISA) with 128-bit SIMD organization for a single and double precision instructions. The current generation of Cell contain 256 KiB embedded SRAM for each SPE for instruction and data, called “Local Storage”. SPU cannot directly access system memory; the 64-bit virtual memory address formed by the CPU must be passed from the SPU to the SPE memory frlow controller to set up a DMA operation within the system address space.

In on typical usage scenario, the system will load the SPEs with small programs / threads chaining the SPEs together to handle each step in a complex operation. At 3.2GHz each SPE gives a theoretical 25.6 GFLOPS of single precisions performance.

Compared to its personal computer contemporaries, the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in CPUs like the Pentium 4 and the Athlon 64. However, comparing only floating point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictors. The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches 20.8 GFLOPS (1.8 GFLOPS per SPE, 6.4 GFLOPS per PPE). The PowerXCell 8i variant, which was specifically designed for double-precision, reaches 102.4 GFLOPS in double-precision calculations.

The EIB (Elements Interconnect Bus) is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants in the PS3 (the number of SPU can vary in industrial applications). EIB also includes an arbitration unit which functions as a set of traffic lights.

The EIB is presently implemented as a circular ring consisting of four 16 bytes wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96 bytes per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer).

Each participant on the EIB has one 16 byte read port and one 16 byte write port. The limit for a single participant is to read and write at a rate of 16 byte per EIB clock (for simplicity often regarded 8 byte per system clock). Each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU’s ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.

Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency.

Bandwidth assessment

For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.

At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as “greater than 300 GB/s”. This number reflects the peak instantaneous EIB bandwidth scaled by processor frequency.

However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus.

Each unit on the EIB can simulatenously send and receive 16 bytes of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128 bytes, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s

In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.

Optical interconnect

Sony is currently working on the development of an optical interconnection technology for use in the device-to-device or internal interface of various types of cell-based digital consumer electronics and game systems.

Memory and I/O Controllers

Cell contains a dual channel Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6 GB/s.

The I/O interface, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.

Graphics Processing Unit

PlayStation 3 utilize RSX ‘Reality Synthesizer’, a proprietary graphic processing unit (GPU) co-developed by Nvidia and Sony. For the specification given by Sony can be read here:

  • 550 MHz on 90 nm process (shrunk to 65 nm in 2008 and to 40 nm in 2010)
  • Based on G71 Chip in turn based on the 7800 but with cut down features like lower memory bandwidth and only as many ROPs as the lower end 7600.
    • 300+ million transistors
    • Multi-way programmable parallel floating-point shader pipelines
      • Independent pixel/vertex shader architecture
      • 24 parallel pixel-shader ALU pipes clocked @ 550 MHz
        • 5 ALU operations per pipeline, per cycle (2 vector4, 2 scalar/dual/co-issue and fog ALU, 1 Texture ALU)
        • 27 floating-point operations per pipeline, per cycle
      • 8 parallel vertex pipelines @550 MHz
        • 2 ALU operations per pipeline, per cycle (1 vector4 and 1 scalar, dual issue)
        • 10 floating-point operations per pipeline, per cycle
      • Floating Point Operations: 400.4 Gigaflops (24 * 27 Flops * 550 + 8 * 10 Flops * 550)
        • 74.8 billion shader operations per second (24 Pixel Shader Pipelines*5 ALUs*550 MHz) + (8 Vertex Shader Pipelines*2 ALUs*550 MHz)
    • 24 texture filtering units (TF) and 8 vertex texture addressing units (TA)
      • 24 filtered samples per clock
        • Maximum texel fillrate: 13.2 GigaTexels per second (24 textures * 550 MHz)
      • 32 unfiltered texture samples per clock, ( 8 TA x 4 texture samples )
    • 8 Render Output units / pixel rendering pipelines
      • Peak pixel fillrate (theoretical): 4.4 Gigapixel per second
      • Maximum Z sample rate: 8.8 GigaSamples per second (2 Z-samples * 8 ROPs * 550 MHz)
    • Maximum Dot product operations: 56 billion per second (combined with Cell CPU)
    • 128-bit pixel precision offers rendering of scenes with High dynamic range rendering (HDR)
    • 256 MB GDDR3 RAM at 700 MHz
      • 128-bit memory bus width
      • 22.4 GB/s read and write bandwidth
    • Cell FlexIO bus interface
      • 20 GB/s read to the Cell and XDR memory
      • 15 GB/s write to the Cell and XDR memory
    • Support for PSGL (OpenGL ES 1.1 + Nvidia Cg)
    • Support for S3TC texture compression

To be honest, all specification given is far superior than common desktop PC or laptop with relatively cheaper price. What’s your opinion?

, ,

About Author

about author

xathrya

A man who is obsessed to low level technology.

Leave a Reply

Your email address will not be published. Required fields are marked *

Social media & sharing icons powered by UltimatelySocial