Somewhat limited improvements of client and server CPUs coupled with delays of new process technologies made Intel look pale. But being a large company has its perks and Intel's roadmap reveals a path to regaining market share and mind share...
At its Architecture Day last month Intel Corp. disclosed some additional details about its upcoming process technologies and how they will affect its product lineup from 2020 and onwards. Furthermore, Intel revealed its packaging technologies roadmap, which is crucial for its next-generation CPUs, GPUs, FPGAs, and other products for different market segments. Finally, the company disclosed its new chiplet ideology that promises to change the way how Intel builds its chips.
In search of new Intel
Throughout its history Intel has had multiple key building blocks that enabled its growth and prosperity: microarchitectures that offered the right balance between simplicity and performance; top-notch process technologies and production capacities; and corporate strength coupled with a well-known brand. But the world is constantly evolving and what was good enough 10 years ago is not enough for today. Intel tends to recognize strategic inflection points early enough to respond to new challenges. So, when the company realized that it needed to offer more than CPUs to stay ahead of the competition, it introduced client and server platforms in the early 2000s and acquired FPGA and various AI/ML/DL companies in the second half of 2010s to gain appropriate assets.
But getting new assets is only a part of the job for a multi-billion corporation like Intel. To fully reinvent itself for the incoming, Intel needs the right set of building blocks to make new products. This is where new process technologies come in, only this time it is not enough for Intel to have the right manufacturing nodes. The company needs to build heterogeneous multi-chip/multi-tile solutions to address new workloads. Building them requires advanced 2.5D and 3D packaging technologies. Furthermore, Intel might need to embrace third-party manufacturing technologies at a larger scale than it does today, beyond its chipsets, FPGAs, and others products (Mobileye, Movidius, etc.).
Process technologies: 10nm evolution
As fabrication technologies get more complex, it gets harder to iterate them and introduce a new manufacturing node every 18 – 24 months. A new major node today means new materials, new device architectures, and a host of other things that make everything work. Foundries have always taken iterative approach to their process technologies in a bid to offer an improvement on a yearly (or so) cadence, but Intel did not need exactly that since its Tick-Tock model worked perfectly and at some point its customers ceased to announce their products at traditional seasons (we will talk about it later in this article).
Everything changed with the arrival of Intel’s 14nm process, which got off to such a rough start with the codenamed Broadwell processor in 2014 that the company had to revamp it for the subsequent Skylake generation launched in 2015. Over time, Intel had to improve its 14nm node several times, which eventually enabled the company to boost performance provided by the node by over 20%. Intel used several methods to improve 14nm performance, though has never revealed all of its secrets.
What should be noted is that all makers of semiconductors always perform continuous process improvements (CPI) to improve yields and reduce performance variations through the means of statistical process control (SPC), which in general means lower costs. What Intel did was beyond regular CPI-related improvements as the company changed FinFET structure and some other things and that required new libraries and an almost complete redesign of chips. It appears, Intel uses the same approach to 10nm improvements.
With 10nm, Intel set so aggressive goals that the first version of the process was largely a commercial failure with its Cannonlake processor, which was used for a limited number of products. But the technology set the ground rules and a path for further innovation, its three key features — contact over active gate (COAG), usage of cobalt for local interconnects, and self-aligned quadruple patterning (SAQP) at M0 and M1 layers — are still considered as the main pillars of the whole 10nm node at large. The company significantly redesigned the technology for its codenamed Ice Lake processors. Intel’s 2nd Generation 10nm node features a new Fin architecture to boost performance; an altered BEOL process to simplify integration as well as improve performance, change process integration to boost margins; and a different contact layout (think of an improved implementation of COAG on the design level) to improve yields, according to TechInsights. Meanwhile, usage of the 2nd Generation 10nm process was limited to mobile CPUs with a relatively small die size, an indicator that the node still needed a refinement by Intel’s standards for margins.
For its Tiger Lake family of mobile processors and a number of other products, Intel is set to use its 3rd generation 10nm process that is officially called 10nm SuperFin. The new node is said to bring an 18% performance improvement when compared to Intel’s ‘base’ 10nm technology. Intel’s 10nm SuperFin process will be used much more widely than the company’s 1st and 2nd generation 10nm technologies, which is why Intel needed across-the-board improvements of performance and energy efficiency.
“After years of refining the FinFET platform, we are redefining it to deliver an unprecedented level of performance uplift,” Ruth Brain, an Intel Fellow and director of interconnect technology and integration. “We achieved this through a combination of innovations across the entire process stack. From the bottom of the transistor channel, all the way to the top interconnect metal layers.”
10nm SuperFin platform: for Tiger Lake, Xe-LP GPU, and more
Intel says that the new technology is an evolution of the entire process stack, but there are three key areas that Intel highlighted in its announcement:
Within the transistor, Intel changed epitaxial growth of crystal structures on the source and drain to reduce resistance and increase the strain. This reduces resistance and parasitic capacitance and increases the current flow through the channel. Intel further improved source/drain architecture to push additional higher channel mobility and enabling higher current. Lower resistance and higher currents typically mean ability to reduce power consumption when it is required and improve performance when it is needed. In addition, Intel increased gate-to-gate pitch spacing for higher drive currents (because of lower parasitic capacitance) in a bid to enable even higher performance when it is needed. In addition, relaxed pitches could also help to improve yield. Interestingly, Intel says that its additional gate pitch did not affect transistor density because it also optimized its libraries for the new process.
“But even if you go into these high frequency tight blocks, even with a slightly looser gate pitch, you can still gain in area,” said Brain. “I know that sounds counterintuitive. But when you have the right types of library heights and gate pitches, each cell is actually a higher performance itself. “So it takes less cells of buffers and inverters basically to build onto that to get the performance that we are actually achieving. So, despite the fact that we did do a looser gate pitch, [transistor density was] equivalent or better on area for those blocks, even with a loser date pitch.”
Performance and power of a transistors heavily depends on performance of chip’s power delivery network. PDN must be tailored for a particular process technology and a particular design in a bid to efficiently supply power and quickly respond to behavior of a modern processor (which changes drastically depending on the workload). This where the new barrier for transistor contact vias, the new metal-insulator-metal capacitor, and the redesigned lower metal stack (M0 – M3) come into play.
Resistance at transistor contact vias (conductors) greatly impacts performance and power of modern transistors, so virtually with every new node makers of semiconductors try to fight it as physical dimensions of contacts shrink and their resistance increases. In fact, process technologies that use EUVL now use selective tungsten deposition process for vias in a bid to radically reduce resistance at transistor contacts by removing dielectric layers that consist of titanium/titanium nitride (TiN) liner/barrier and tungsten nucleation layers. This is not the case with Intel’s 10nm node because of several reasons. The company’s existing fabs for 10nm process are not configured for selective tungsten deposition and their reconfiguration changes the whole process flow, which is risky overall and additional risks are not something that Intel can afford these days. As a result, the company has to invent new methods to fight via resistance using available tools.
One of these methods is to make dielectric (cladding) layers of conductors thinner by either changing the way they are formed or using a different material. Cobalt transistor contacts Intel used for its 10nm process already enabled the company to make the TiN liner/barrier thinner and remove a nucleation layer. Apparently, Intel’s Novel Thin Barrier makes the cladding layers even thinner, which allows to improve the conductors and reduce resistance by 30%. Ultimately, better transistor contact vias mean higher performance and lower power.
Modern processors are very complex devices. They draw a lot of power when they use vector units and consume much less when they perform simple arithmetic using integer units. To make the matters more complex, CPUs for client and datacenter processors are optimized for different behavior. Server CPUs operate under significant loads all the time, but can burst to higher frequencies when demands spike. Client processors spend most of their time in idle mode or under light loads. But when you fire up an application or perform a task that requires compute horsepower, these CPUs need to burst their performance from idle to the maximum and go beyond in a matter of a microsecond to guarantee a fine user experience. In fact, client CPUs are optimized for burst behavior. To enable these ultra-fast bursts for its 10nm SuperFin designs, Intel incorporated its so-called Super MIM capacitor into the BEOL metallization stack.
“It looks like they really redesigned the lower metal layers,” said David Kanter, president of Real World Tech and inference and power co-chair at MLPerf. “It wouldn’t surprise me if they relaxed the pitch from 36nm to 40nm, which would generally improve resistance and also yield.”
Intel says that the Super MIM is a metal-to-metal parallel plate structure consisting of ‘different Hi-K materials each just a few Angstroms thick, stacked in a repeating superlattice.’ The supercapacitor is said to increase areal decoupling density by 5 times and greatly improves power supply stability. The latter not only has an effect of a processor’s behavior, but also on yields, which means costs.
“I think if you look back 10 years ago, the focus was mostly on the transistor, but what we see today is that the metal stack is equally important for performance and increasingly yield,” said Kanter. “Part of it is that over time, we have had to add more and more metal layers, so the metal stack becomes a larger part of the cost and yield equation.”
The very first product to use Intel’s 10nm SuperFin node is the company’s 11th generation Core CPU codenamed Tiger Lake. Based on the company’s claims, the new process technology — along with the Willow Cove CPU core architecture and the Xe-LP graphics core architecture — enables Tiger Lake to hit higher frequencies at lower voltages and burst to higher clocks at higher voltages for greater performance when it is needed most. In fact, Intel says that Tiger Lake CPU can scale all the way to 55W, which is good enough for a desktop CPU.
This higher voltage/higher frequency capability is a double-edged sword though. On the one hand, it helps to offer higher performance and better user experience. But a lot has to be done on the system level for this. Cooling is an obvious example, but in addition to that there are a quality voltage regulator module (VRM), firmware, and fine-tuning. All of these are hard to design and they come at a cost. In fact, the challenging designs are among the reasons why big PC makers are no longer inclined to make big announcements at trade shows.
In addition to Tiger Lake CPUs, Intel uses its 10nm SuperFin to produce its Xe-LP-based DG1 discrete GPU and SG1 card running four DG1 GPUs. Tiger Lake SoCs in their current UP3 and UP4 forms can scale from 12W to 28W and from 7W to 15W, respectively. Meanwhile, the architecture can even hit 55W, a rather good scalability. At this point, we do not know much about the discrete Xe-LP GPU except the fact that it is aimed primarily at laptops as well as cloud gaming and video streaming applications. But in any case, all GPUs are tailored for burst performance and variable clocks.
It should be noted that Intel’s 2nd generation 10nm process technology can also boast with good scaling. Intel’s 11th generation Core ‘Ice Lake’ processors for notebooks have a TDP between 8W and 28W, whereas the Atom x6000E-series ‘Elkhart Lake’ SoCs start at 4.5W (entry-level) and top at 12W (high-end), so it is evident that the process technology can also be used for ultra-low-power chips at this point. Since the technology will also be used for the upcoming Xeon Scalable ‘Ice Lake-SP’ processors that will have to hit hundreds of Watts, it is clear that Intel has been meticulously working on its 2nd generation 10nm process to make both ultra-low-power and ultra-high-performance CPUs possible.
Process technologies: 10nm Enhanced SuperFin
Intel’s 10nm SuperFin fabrication process was primarily developed for client CPUs and GPUs that happen to be relatively small yet have to burst to the max rapidly. Large chips for datacenters have different requirements, so Intel has also developed its 10nm Enhanced SuperFin technology to meet these requirements.
“Our friends in the data center saw what we were developing with the new 10nm SuperFin technology with the client group,” said Brain. “They asked us to make some further enhancements for datacenter product specifically. Servers benefit from interconnect enhancements due to the large amount of data that needs to be shared across the chip. So in addition to continued transistor optimization to deliver more internode performance, we also focused on improving the metal stack with interconnect layer optimizations that make datacenter scale fabrics for CPU and GPU more easily routable.”
With its 10nm Enhanced SuperFin, Intel again goes beyond traditional CPI and SPC. This node promises to bring in additional performance, and new interconnects, though Intel does not disclose all the details just yet.
Modern server processors tend to feature high core counts that need to work very efficiently under typical loads to keep power bills down. These cores also have to be able to work at high clocks to ensure quality-of-service when demand is high. Given how server CPUs and GPUs are used and made nowadays, process libraries have to be optimized for all types of IP.
“Generally, I expect that in a server chip you are less sensitive to density and really value performance per Watt,” said Kanter. “The server process may have some additional modules and features, its transistors may be tuned differently. […] In many cases, tuning the transistor is about picking parameters like the threshold voltage and max voltage. Also, the process characterization and control will be better.”
Since datacenter chips tend to be huge, their power delivery circuitry has to be very advanced in a bid to feed all those transistors, so Intel will again enhance its MEOL and BEOL.
“So far, Intel has disclosed three products that will be made using its 10nm Enhanced SuperFin technology: the codenamed Sapphire Rapids processor for servers (which is allegedly based on the codenamed Golden Cove cores), the upcoming Xe-HP GPU for datacenters, and the Rambo cache tile for its Ponte Vecchio GPU for supercomputers. Ice Lake-SP processors for datacenters still use Intel’s 2nd generation 10nm technology, the same one used for Ice Lake CPUs for client PCs.
“So the Sapphire Rapids uses the Enhanced SuperFin that we alluded to today,” said Raja Koduri. “The Ice Lake[-SP] is based on a prior version of 10nm, so it is not a SuperFin. When Ice Lake[-SP] was conceived and optimized in terms of architecture and process, it was co-optimized in a collaboration with 10nm.”
What is intriguing is which process technology will be used to make 10nm CPUs for desktops?”
“Believe it or not, we actually want a different process for notebooks, servers, and desktops,” said Kanter.
Unlike notebooks, desktops do not have to be very energy efficient, yet they have to be optimized for high frequencies and bursty performance. Unlike servers, they do not need to process large chunks of data 24/7, but they still need to have enough horsepower to run demanding workloads. Meanwhile, costs matter for desktop client CPUs, so high transistor density has to be maintained.
Process technologies: 7nm
Keeping in mind Intel’s delay of 7nm high-volume manufacturing, the company does not really want to talk about it. In fact, Intel did not even call the successor of its 10nm process as ‘7nm’, but used the ‘next generation’ term instead back in August. Meanwhile, Intel did confirm that compute tiles of its ambitious Ponte Vecchio supercomputer GPU will be made both at an external fab and internally using its ‘next generation’ node.
Since we do not know almost anything about Ponte Vecchio’s compute tile, it is hard to make guesses how Intel’s 7nm is optimized. It is reasonable to expect Intel to use its next-generation datacenter node for the compute tile. Meanwhile, given the nature of supercomputers that work under high loads all the time, Intel might use a custom datacenter process with some optimizations for its Xe-HPC GPU.
What is particularly noteworthy about Intel’s Ponte Vecchio is that Intel will use four or five different process technologies as well as Foveros and EMIB packaging technologies to build this processor.
Chiplet designs change SoC methodology, devtime, and costs
In fact, Intel’s Ponte Vecchio is a great example how complex chips will be built in the future.
The rise of AI/ML as well as increasing demand for performance by exascale and cloud datacenters largely fueled the die size race in the recent years. Nvidia’s A100 GPU includes 54.2 billion transistors with a die size of 826 mm2, whereas Graphcore’s Colossus MK2 GC200 intelligence processing unit (IPU) packed 59.4 billion transistors probably at an even larger area. The maximum reticle size of today’s DUV and EUV scanners is 26 mm by 33 mm, or 858 mm², so if someone wants to go beyond that, they have to use some novel packaging technologies.
But there is a big problem with these large monolithic chips: it takes three to four years and hundreds of millions of dollars to design and verify them (not counting architecture development), according to Intel. In addition, at tens of billions of transistors, these chips are so complex that there are hundreds of bugs in silicon, which should be somehow mitigated, which means costs beyond hundreds of millions spent on silicon design. Furthermore, even at low defect densities, it is hard to reach truly high yields for such chips. The latter is not a huge problem if you sell the final your product for $10,000+ per chip (and much more per system), but at some point low yields might constrain your business.
For processors, Intel has put two pieces of silicon into the package for years now, whereas AMD went even further and adopted a nine-piece design for its Epyc ‘Rome’ CPU to increase core count (keep in mind the Amdahl’s law though, not every software application scales). According to Intel, even a multi-die design takes from two to three years to design because things like CPUs are very complex. The semiconductor giant says there are still tens of bugs in such designs that have to be mitigated.
Intel says that it needs to change SoC design methodology further. At present, Intel thinks about disaggregating certain components at IP level, produce IP cores using the most optimal process technology, and then purpose-build processors in accordance with their usage model to provide the best user experience. For example, an office worker needs enough performance for productivity along with a compact form-factor, but does not need the amount of general-purpose compute and graphics horsepower a content creator needs.
“Our purpose has changed from building monolithic general-purpose SoC to building scalable purpose-built devices, to provide rich user experience,” said Brijesh Tripathi, general manager of Client Architecture and Immersive Computing at Intel. “We have used all the technologies we talked about today to build this architecture, including disaggregation, advanced packaging tools, new memory technologies, and extreme innovation in virtualization and One software API.”
Intel’s codenamed Lakefield processor launched earlier this year uses some of the principles outlined by Intel: its compute and graphics die is made using 2nd generation 10nm technology, its base/interposer die is produced using 22FFL node, and it has a package-on-package LPDDR4X memory module. The upcoming Ponte Vecchio will use the said chiplet principles more extensively, yet right now the company calls the architecture ‘Client 2.0.’
“Overall, Client 2.0 is about delivering winning products at an annual cadence with these vectors in mind: experience first, scalable, energy efficient, and efficient use of Moore’s law,” said Tripathi. “I am super excited with what is possible with Client 2.0.”
To some degree, disaggregation of SoCs and chiplet design will help Intel to overcome its challenges with process technology simply because it is easier to produce smaller homogenic dies than to make big heterogeneous processors. By disaggregating its SoCs, Intel hopes to ensure that it can deliver products to partners on a predictable annual cadence, which will greatly help its customers (and Intel itself).
But chiplet designs have their own peculiarities and limitations. Usage of heterogeneous tiles made using different process technologies (like in case of Lakefield and Ponte Vecchio) and by different producers makes power management of such designs a challenge. Intel says that disaggregated chiplet approach will enable it to outsource production of a component it cannot produce itself to a third party, but capacity bookings have to be made early enough and this will hurt margins. Logistics of multi-chip designs will be different from what Intel has today. Last but not least, advanced packaging and interconnect technologies Intel is talking about use actual semiconductor fabs, which are both expensive and not particularly easy to use.
Packaging technologies: Going to sub-10 micron bump pitches
At present, Intel has two advanced packaging methods: the 2.5D Embedded Multi-Die Interconnect Bridge (EMIB) interconnect as well as the 3D logic-to-logic stacking Foveros technology. Both technologies are used for commercial products and both are being refined for future designs.
The EMIB features 55-36µm (micron) bump pitches to enable a bump density from 330 to 722 bumps per square millimeter, and uses 0.50 picojoules per bit transferred. Foveros brings in 25~50µm bump pitches, a bump density from 400 to 1,600 bumps per mm2, and uses 0.15 picojoules per bit transferred. These technologies are good enough to build products like Lakefield, but for Intel’s longer-term plans that the company needs to increase bump density, reduce power consumption, reduce capacitance, mix and match 2.5D and 3D packaging, and ensure that everything is manufacturable and works.
“Ultimately, our packaging technologies are about increasing density and decreasing power allowing chiplets to be connected in a package with functionality that matches or exceeds the functionality of a monolithic SoC,” said Ramune Nagisetty, director of product and process integration at Intel. “The benefits to this approach include lower cost, greater flexibility, and quicker time to market. In order to gain these benefits, chiplet architectures need to be developed and co optimized with silicon design and the overall system.”
For denser, higher-bandwidth, and more energy-efficient logic-on-logic vertical 3D integration Intel is working on its Hybrid Bonding technology that is set to feature sub-10µm bump pitches to enable bump densities of over 10,000 interconnects per mm2 at 0.05 pJ/bit. Intel has already taped out the first multi-die SRAM chip enabled by Hybrid Bonding. Considering that it usually takes several years to perfect a technology, it is likely that Hybrid Bonding will be used by Intel in its 7nm era, yet the chip giant does not commit to any timeframes at this point.
In addition to very complex interconnection design, logic-on-logic integration has two more challenges: thermals and power delivery. Intel says that the most bandwidth and power hungry chiplet has to be placed on top of the stack. To ensure that the top layer gets enough bandwidth and power, Intel has developed its Omni-Directional Interconnect (ODI) technology that uses copper pillars to connect the top layer to the package substrate. Such pillars are better for power delivery than TSVs (used by Foveros) and yet they can also be used as high-bandwidth interconnects. Also, they enable chiplet designers to optimize floorplans of different chiplets in the stack independently and without caring about die sizes. Intel has not disclosed bump densities for ODI.
Intel’s Hybrid Bonding and OBI technologies will enable vertical scaling of Intel’s chiplet designs. For ‘horizontal’ larger than reticle scaling Intel proposes to use Co-EMIB, a technology that promises to offer a lot of benefits for multi-chip designs aimed at HPC and other performance-demanding applications.
“Co-EMIB is essentially a combination of EMIB and Foveros which allows us to create a larger than reticle size base for high-density interconnect between Foveros die stacks,” said Brain. “This increases our partitioning opportunities, which is especially attractive for datacenter and HPC applications.”
Intel has a rather solid roadmap for its 2.5D and 3D packaging and interconnection methods that will support its chiplet designs for various purposes. Intel naturally outlines only advantages that its forthcoming technologies bring and never mentions challenges that it faces as well as costs of such designs.
Common sense implies that chiplet approach is meant to be cheaper than a large monolithic die approach. Yet, the industry is still struggling with the costs of HBM memory that stacks DRAM layers on a relatively simplistic base die and features 55µm bump pitches and 25µm bumps. With Hybrid Bonding, Intel goes to sub-10µm bump pitches and probably sub-5µm bumps and this is going to be expensive and hard to implement. In fact, at some point, Intel might face similar physical problems with its packaging technologies as it faces with its semiconductor technologies.
In the semiconductor world, there are decisions that have a long-lasting effect. Setting aggressive goals for its initial 10nm process technology many years ago required Intel to equip its fabs with certain tools and adopt certain production flows that use these devices. When the 1st generation 10nm process technology demonstrated way too high defect densities, Intel could not just abandon it and develop a new node from scratch because of time, development costs, and CapEx invested in fabs and equipment. As a result, the company has to refine its 10nm technology. It is necessary to note that so far the results of Intel’s refinement have been rather spectacular as far not all new nodes from foundries bring a 20% performance improvement.
In addition to refining its 10nm node, Intel also had to diversify its process technologies: there are now 10nm SuperFin for client/small chip designs tailored for high clocks as well as 10nm Enhanced SuperFin for larger datacenter-grade solutions tailored for large chips and ultimate power efficiency. Intel still has to disclose which version of the process will be used for its 10nm desktop parts.
Having observed that processors for different applications have dramatically different requirements these days and building large a monolithic SoC for every kind of application is both time consuming and expensive, Intel seemed to reconsider the way it wants to build its SoCs in the future. Intel’s chiplet approach — already used for its Lakefield SoC and set to be used for the Ponte Vecchio supercomputer GPU — promises benefits for Intel’s customers, end user, and the chip giant itself.
Chiplets require advanced packaging and interconnection technologies for 2.5D and 3D integration. Intel does have a solid packaging roadmap that includes advancements to existing EMIB and Foveros as technologies well as all-new Hybrid Bonding, Omni-Directional Interconnect (ODI), and Co-EMIB technologies that will come to fruition in the coming years. 2.5D and 3D packaging methods are much more complex and expensive than traditional chip packaging, but they are so crucial to the company’s roadmap roadmap that the company just has to succeed.