HBM (High Bandwidth Memory) is a new type of CPU/GPU memory chip (ie "RAM"). In fact, many are stacked together and packaged with the GPU to achieve a large-capacity, high-bit-width DDR combination array.
HBM plan view
The middle die is GPU/CPU, and the 4 small dies on the left and right sides are the stack of DDR particles. In the stacking, there are generally only three stacks of 2/4/8, and a maximum of 4 layers in a three-dimensional stack.
When HBM (High Bandwidth Memory) existed as a GPU memory, it seems to be not uncommon now. Many people may know that HBM is expensive, so even if it is not rare, you can only see it on high-end products. For example, Nvidia's GPU for data centers. AMD's use of HBM on consumer GPUs is a rare example.
Some gamers should know that HBM is a high-speed memory with a bandwidth far exceeding DDR/GDDR. Its internal structure also uses 3D stacked SRAM. Some PC users have imagined that if HBM memory can be used in general personal computers and notebook products. Although the cost is high, there are many rich owners in this industry. Besides, isn't the GPU just using HBM?
In fact, the central processing unit that matches with HBM is not non-existent. The A64FX chip used in Fujitsu's supercomputer Fugaku is matched with HBM2 memory. In addition, Intel's soon-to-release Sapphire Rapids Xeon processors will have an HBM memory version next year. There are also such as NEC SX-Aurora TSUBASA.
Then we know that CPU with HBM is at least feasible (although in a strict sense, chips such as A64FX have exceeded the scope of CPU), but these products are still for data center or HPC applications. Is it because it is expensive, so it is not decentralized to the consumer market? This may be an important reason or relatively close to the source. In this article, we will take the opportunity of talking about HBM to talk about the characteristics and usage scenarios of this kind of memory, and whether it will replace the very common DDR memory on computers in the future.
Looking at HBM from above, source: Fujitsu
In terms of the common form of HBM, it usually exists in the form of a few die (package) from the surface. It is very close to the main chip (such as CPU, GPU). For example, like the picture above, A64FX looks like this, and the surrounding 4 packages are all HBM memory. This type of existence is quite different from general DDR memory.
One of the characteristics of HBM is to achieve higher transmission bandwidth with a smaller size and higher efficiency (partially) than DDR/GDDR. And in fact, each HBM package is stacked with multiple layers of DRAM die, so it is also a 3D structure. DRAM die is connected by TSV (Through Silicon Via) and microbump. In addition to the stacked DRAM die, there will be an HBM in the lower layer Controller logic die. Then the bottom layer is interconnected with CPU/GPU, etc. through the base die (for example, silicon interposer).
Looking at HBM from the side, source: AMD
From this structure, it is not difficult to find that the interconnect width is much larger than that of DDR/GDDR, and the number of interconnected contacts below can be far more than the number of lines connecting the DDR memory to the CPU. The implementation scale of HBM2's PHY interface is not on the same level as the DDR interface; the connection density of HBM2 is much higher. From the perspective of transmission bit width, each layer of DRAM die has two 128-bit channels, and the total HBM memory with the height of the 4-layer DRAM die is 1024 bits wide. Many GPUs and CPUs have 4 pieces of such HBM memory around them, and the total bit width is 4096 bits.
For comparison, each channel of GDDR5 memory is 32bit wide, and 16 channels are 512bit in total. In fact, the current mainstream second-generation HBM2 can stack up to 8 layers of DRAM die per stack, which has improved capacity and speed. Each stack of HBM2 supports up to 1024 data pins, and the transfer rate of each pin can reach 2000Mbit/s, so the total bandwidth is 256Gbyte/s. Under the transfer rate of 2400Mbit/s per pin, the bandwidth of an HBM2 stack package is 307Gbyte/s.
Source: Synopsys
The above picture is a comparison of DDR, GDDR, and HBM given by Synopsys. You can see the abilities of other players in the Max I/F BW column, which is not in the same order of magnitude as HBM2. With such a high bandwidth, in applications such as highly parallel computing, scientific computing, computer vision, and AI, it is simply a refreshing rhythm. And from an intuitive point of view, HBM and the main chip are so close, theoretically higher transmission efficiency can be obtained (from the energy consumption per bit of data transmission, HMB2 does have a great advantage).
I feel that in addition to the cost and total memory capacity of HBM if it is really used as memory on a personal computer, wouldn't it be perfect?
This type of memory of HBM was first initiated by AMD in 2008. AMD's original intention for HBM was to make changes in power consumption and the size of computer memory. In the following years, AMD has been trying to solve the technical problems of die stacking, and later found partners in the industry with storage media stacking experience, including SK Hynix, and some manufacturers in the interposer and packaging fields.
HBM was first manufactured by SK Hynix in 2013. And this year HBM was adopted by the JESD235 standard of JEDEC (Electronic Components Industry Association). The first GPU to use HBM storage was AMD Fiji (Radeon R9 Fury X) in 2015. The following year Samsung began mass production of HBM2-NVIDIA Tesla P100, which was the first GPU to use HBM2 storage.
From the shape of HBM, it is not difficult to find its first shortcoming: the lack of flexibility in system collocation. For PCs in the early years, the expansion of memory capacity is a relatively conventional capability. And the HBM is packaged with the main chip, there is no possibility of capacity expansion, and the specifications are already fixed at the factory. And it is different from the current notebook equipment where DDR memory is soldered to the motherboard. HBM is integrated on the chip by the chip manufacturer-its flexibility will be weaker, especially for OEM manufacturers.
For most chip manufacturers, pushing processors for the mass market (including the infrastructure market), based on various considerations including cost, is unlikely to launch chip SKU models with various memory capacities. The processors pushed by these manufacturers have various configuration models (for example, there are various models of Intel Core processors)-if you consider the difference in subdivided memory capacity, the manufacturing cost may be difficult to support.
The second problem with HBM is that memory capacity is more limited than DDR. Although a single HBM package can stack 8 layers of DRAM die, each layer is 8Gbit and 8 layers are 8GByte. Supercomputing chips like A64FX leave 4 HBM interfaces, that is, 4 HBM stack packages and a single chip has a total capacity of 32GByte.
Such a capacity is still too small for DDR. It is very common for ordinary PCs in the consumer market to pile up more than 32GByte of memory. Not only are there a large number of expandable memory slots on PCs and server motherboards, but some DDR4/5 DIMMs are also stacking DRAM die. Using relatively high-end DRAM die stacking, 2-rank RDIMMs (registered DIMMs) can achieve 128GByte capacity-considering 96 DIMM slots in high-end servers, that is at most 12TByte capacity.
HBM DRAM die source: Wikipedia
Of course, I have mentioned that HBM and DDR can be mixed together. HBM2 is responsible for high bandwidth but small capacity and DDR4 is responsible for slightly lower bandwidth but large capacity. From a system design perspective, the HBM2 memory in the processor is more like an L4 cache.
For PCs, an important reason why HBM has not been applied to CPU main memory is its high latency. Regarding the issue of latency, although many popular science articles will say that its latency is good, or Xilinx described its latency as similar to DDR for FPGAs equipped with HBM, the "latency" in many articles may not be the same latency.
Contemporary DDR memory is generally marked with CL (CAS latency, the clock cycle required for column addressing, indicating the length of the read latency). The CAS delay we are talking about here refers to the waiting time between when the read command (and Column Address Strobe) is issued and the data is ready.
After the tells the memory that it needs to access the data in a specific location, it takes several cycles to reach the location and execute the instructions issued by the controller. CL is the most important parameter in memory latency. In terms of the length of the delay, the "period" here actually needs to be multiplied by the time per cycle (the higher the overall operating frequency, the shorter the time per cycle).
GDDR5 and HBM
For HBM, one of its characteristics as mentioned earlier is the ultra-wide interconnection width, which determines that the transmission frequency of HBM cannot be too high. Otherwise, the total power consumption and heat cannot be supported and it does not need such high total bandwidth.
The frequency of HBM will indeed be much lower than that of DDR/GDDR. Samsung’s previous Flarebolt HBM2 memory has a transmission bandwidth of 2Gbit/s per pin, which is almost a frequency of 1GHz. Later, there are products that increase the frequency to 1.2GHz. Samsung mentioned that this process also needs to consider reducing the parallel clock interference between more than 5000 TSVs, and increasing the number of heat dissipation bumps between the DRAM die to alleviate the heat problem. In the above figure, AMD lists the frequency of HBM in fact only 500MHz.
The characteristics of high bandwidth and high latency determine that HBM is very suitable as a GPU memory because games and graphics processing themselves are highly predictable and highly concurrent tasks. The characteristic of this type of load is that it requires high bandwidth and is not so sensitive to delay. So HBM will appear on high-end GPU products. According to this reason, HBM is actually very suitable for HPC high-performance computing and AI computing. Therefore, although A64FX and next-generation Xeon processors are CPUs, they will also choose to consider using HBM as memory.
But for personal computers, the tasks to be processed by the CPU are extremely unpredictable, requiring various random storage accesses, and are inherently more sensitive to latency. And the requirements for low latency are often higher than those for high bandwidth requirements. Not to mention the high cost of HBM. This determines that, at least in the short term, it is difficult for HBM to replace DDR on PCs. It seems that this problem is similar to whether GDDR can be applied to PC memory.
But in the long run, no one can predict the situation. As mentioned above, a hybrid solution can be considered. And the storage resources of different levels are undergoing significant changes. For example, not long ago, we also wrote an article that AMD has stacked the L3 cache on the processor to 192MB. For the component of the in-die cache that originally hides the external storage delay, as the cache on the processor chip becomes larger and larger, the delay requirements for the system memory will not be so high.
From the PC era to the mobile and AI era, the architecture of the chip has also moved from being CPU-centric to data-centric. The test brought by AI includes not only chip computing power, but also memory bandwidth. Even though the DDR and GDDR rates are relatively high, many AI algorithms and neural networks have repeatedly encountered memory bandwidth limitations. HBM, which focuses on large bandwidth, has become the preferred DRAM for high-performance chips.
At the moment, JEDEC has not yet given the final draft of the HBM3 standard, but the IP vendors participating in the standard formulation work have already made preparations. Not long ago, Rambus was the first to announce a memory subsystem that supports HBM3. Recently, Synopsys also announced the industry's first complete HBM3 IP and verification solution.
As early as the beginning of 2021, SK Hynix gave a forward-looking outlook on the performance of HBM3 memory products, saying that its bandwidth is greater than 665 GB/s and I/O speed is greater than 5.2Gbps, but this is just a transitional performance. Also in 2021, the data released by IP vendors further raised the upper limit. For example, Rambus announced that in the HBM3 memory subsystem, the I/O speed is as high as 8.4Gbps, and the memory bandwidth can be as high as 1.075TB/s.
In June of this year, Taiwan Creative Electronics released an AI/HPC/network platform based on TSMC’s CoWoS technology, equipped with an HBM3 controller and PHY IP, with an I/O speed of up to 7.2Gbps. Creative Electronics is also applying for an interposer wiring patent, which supports zigzag wiring at any angle, and can split the HBM3 IP into two SoCs for use.
The complete HBM3 IP solution announced by Synopsys provides a controller, PHY, and verification IP for a 2.5D multi-chip package system, saying that designers can use memory with low power consumption and greater bandwidth in SoC. Synopsys’ DesignWare HBM3 controller and PHY IP are based on the chip-proven HBM2E IP, while the HBM3 PHY IP is based on the 5nm process. The rate of each pin can reach 7200 Mbps, and the memory bandwidth can be increased to 921GB/s.
At present, Micron, Samsung, SK Hynix, and other memory manufacturers are already following this new DRAM standard. SoC designer Socionext has cooperated with Synopsys to introduce HBM3 in its multi-chip design, in addition to the x86 architecture that must be supported. , Arm’s Neoverse N2 platform has also planned to support HBM3, and SiFive’s RISC-V SoC has also added HBM3 IP. But even if JEDEC is not "stuck" and released the official HBM3 standard at the end of the year, we may have to wait until the second half of 2022 to see HBM3 related products available.