Previous Page  16 / 52 Next Page
Show Menu
Previous Page 16 / 52 Next Page
Page Background


Chip Scale Review July • August • 2019


for multiple levels of cache in a die.

The L1 cache is generally the smallest

and integrated into a given processing

unit. Once there is a need to move data

from a processor to cache, a price in

power is paid. This is especially true

when there is a need to add a network

on chip (NoC) or an arbiter to connect

to many memories and/or processors.

Instead, consider treating the memory

in the adjacent die as if it were bonded

or located on top of the same die, yet

the routes will travel a shorter distance

in the 3D case, given that they can route

just a few microns between die, but

would have to travel millimeters within

the same die in a 2D situation. Using

this mindset, a design can be envisioned

wit h L1 cache completely over t he

processor, and L2 elsewhere. There is

little need for L3 or external memory

unless it is needed for storage capacity

instead of low-latency bandwidth.

H i s t o r i c a l l y, s c a l i ng t r a n s i s t o r

density predictably improved device

t r a n s i s t o r d e n s i t y w i t h e a ch new

process node. For the last few nodes,

s t a r t i n g a r o u n d t h e 2 8 n m n o d e ,

however, this scaling factor was no

longer linear. This was because a given

h ig h - p e r fo r ma nc e de s ig n b e c ame

routing-limited instead of transistor-

limited. As seen in

Figure 9

, designs

have not increased density in recent

nodes because of this phenomenon.

I n s t e a d , c on s i de r t h a t i n a de n s e

DBI i mpleme n t a t ion , t he r e i s t he

opportunity to have far more routing

laye r s be cau se t he rout i ng ca n be

shared between the face-to-face bonded

die. This could pull the density scaling

back to the original linear proportion

a nd r e duc e a r e a i n de s ig n s u s i ng

emerging nodes.



On e q u e s t i on

that often comes

up i s r ega r d i ng

test and KGD. In

a W2W bonding

scenario, there is

little use for wafer

p r o b e p r i o r t o

bonding as there is

no simple way to

eliminate bad sites,

as the whole wafer

will still be bonded.

Some approaches

to alleviate this in a W2W environment,

bu t non e h ave b e e n i mpl eme n t e d

effectively. Now, with the advent of DBI

Ultra, there is the opportunity to assemble

tested chiplets onto a tested host wafer

utilizing only the good sites.

The fine pitch (~1µm) interconnect

used for these truly 3D designs would not

accommodate probing. In addition, the

probe mark left behind may make these

pads unbondable. That said, there are

two main approaches that can be used to

get around this issue as discussed below.

Test a subset of pads.

Leveraging test

compression for built-in self-test (BIST),

a significant portion of the device can be

tested. These would not test the external

interfaces unless there is a pre-drive

loopback or something similar. The

circuits that require the multiple layers

of die to be bonded in order to complete

the circuit would also not be available.

The balance of tests performed at wafer

probe through a BIST engine would be

available through this reduced pad count.

These extra pads can be accommodated

in the DBI® Ultra process

without an impact to the

bonded interface.

Known good die.

If the

devices bonded together

have a composite a rea

less than the equivalent

monolithically designed

part, then the yield should

be expected to be similar.

In other words, assuming

that defects are randomly

scat tered on the wafer

surface, half the area will

capture half the defects.

As long as only the area

and performance benefits

are observed, and no more

is added to the 3D design than what was

in the 2D equivalent, the yield should be

similar, and testing prior to 3D bonding

may not be necessary.

On the other hand, if the die are much

larger, approaching the size of a reticle,

then the composite yield of both die

may be poor. Probing prior to bonding,

leveraging DBI Ultra with KGD by testing

a subset of pads, may make more sense.

Still, a better approach may be to

leverage architectural improvements

t h a t t a ke a dva n t age of t he clo s e r

proximity of compute elements enabled

by DBI and DBI Ultra. Memory self-

repair is widely used cur rently, but

logical self-repair is not as broadly used

on account of the timing requirements in

an SoC. Blocks, such as processors, are

large due to the efficiency of computing

in a 2D layout [10]. This adds to the

timing difficulty in large SoCs. In 3D,

instead, the x-y area can be a fraction of

that depending upon the number of die

layers used, making them more densely

packed and more suitable for having

spare blocks for repair. Considering an

array of compute elements arranged

ove r a n a r r a y o f c omp l eme n t a r y

elements, where microarchitectures are

more densely packed, a repair would

have a shorter reach and be more likely

t o f it wit h i n t he nece ssa r y t imi ng

window for a usable repair. This can

be seen in

Figure 10

where an array or

processors (PRC) and memories (MEM)

can be mapped such that a memory over

a faulty processor can be remapped to a

neighboring processor, and vice-versa.

This is only effective if the mapping

can be made within the same number

of cycles, which is enabled by the close

proximity of this logical repair.

Figure 9:

Gates per mm


with evolving nodes.

Figure 10:

Self-healing methodology for co-designed arrays.