Chip Scale Review July • August • 2019[ChipScaleReview.com]
for multiple levels of cache in a die.
The L1 cache is generally the smallest
and integrated into a given processing
unit. Once there is a need to move data
from a processor to cache, a price in
power is paid. This is especially true
when there is a need to add a network
on chip (NoC) or an arbiter to connect
to many memories and/or processors.
Instead, consider treating the memory
in the adjacent die as if it were bonded
or located on top of the same die, yet
the routes will travel a shorter distance
in the 3D case, given that they can route
just a few microns between die, but
would have to travel millimeters within
the same die in a 2D situation. Using
this mindset, a design can be envisioned
wit h L1 cache completely over t he
processor, and L2 elsewhere. There is
little need for L3 or external memory
unless it is needed for storage capacity
instead of low-latency bandwidth.
H i s t o r i c a l l y, s c a l i ng t r a n s i s t o r
density predictably improved device
t r a n s i s t o r d e n s i t y w i t h e a ch new
process node. For the last few nodes,
s t a r t i n g a r o u n d t h e 2 8 n m n o d e ,
however, this scaling factor was no
longer linear. This was because a given
h ig h - p e r fo r ma nc e de s ig n b e c ame
routing-limited instead of transistor-
limited. As seen in
have not increased density in recent
nodes because of this phenomenon.
I n s t e a d , c on s i de r t h a t i n a de n s e
DBI i mpleme n t a t ion , t he r e i s t he
opportunity to have far more routing
laye r s be cau se t he rout i ng ca n be
shared between the face-to-face bonded
die. This could pull the density scaling
back to the original linear proportion
a nd r e duc e a r e a i n de s ig n s u s i ng
On e q u e s t i on
that often comes
up i s r ega r d i ng
test and KGD. In
a W2W bonding
scenario, there is
little use for wafer
p r o b e p r i o r t o
bonding as there is
no simple way to
eliminate bad sites,
as the whole wafer
will still be bonded.
to alleviate this in a W2W environment,
bu t non e h ave b e e n i mpl eme n t e d
effectively. Now, with the advent of DBI
Ultra, there is the opportunity to assemble
tested chiplets onto a tested host wafer
utilizing only the good sites.
The fine pitch (~1µm) interconnect
used for these truly 3D designs would not
accommodate probing. In addition, the
probe mark left behind may make these
pads unbondable. That said, there are
two main approaches that can be used to
get around this issue as discussed below.
Test a subset of pads.
compression for built-in self-test (BIST),
a significant portion of the device can be
tested. These would not test the external
interfaces unless there is a pre-drive
loopback or something similar. The
circuits that require the multiple layers
of die to be bonded in order to complete
the circuit would also not be available.
The balance of tests performed at wafer
probe through a BIST engine would be
available through this reduced pad count.
These extra pads can be accommodated
in the DBI® Ultra process
without an impact to the
Known good die.
devices bonded together
have a composite a rea
less than the equivalent
part, then the yield should
be expected to be similar.
In other words, assuming
that defects are randomly
scat tered on the wafer
surface, half the area will
capture half the defects.
As long as only the area
and performance benefits
are observed, and no more
is added to the 3D design than what was
in the 2D equivalent, the yield should be
similar, and testing prior to 3D bonding
may not be necessary.
On the other hand, if the die are much
larger, approaching the size of a reticle,
then the composite yield of both die
may be poor. Probing prior to bonding,
leveraging DBI Ultra with KGD by testing
a subset of pads, may make more sense.
Still, a better approach may be to
leverage architectural improvements
t h a t t a ke a dva n t age of t he clo s e r
proximity of compute elements enabled
by DBI and DBI Ultra. Memory self-
repair is widely used cur rently, but
logical self-repair is not as broadly used
on account of the timing requirements in
an SoC. Blocks, such as processors, are
large due to the efficiency of computing
in a 2D layout . This adds to the
timing difficulty in large SoCs. In 3D,
instead, the x-y area can be a fraction of
that depending upon the number of die
layers used, making them more densely
packed and more suitable for having
spare blocks for repair. Considering an
array of compute elements arranged
ove r a n a r r a y o f c omp l eme n t a r y
elements, where microarchitectures are
more densely packed, a repair would
have a shorter reach and be more likely
t o f it wit h i n t he nece ssa r y t imi ng
window for a usable repair. This can
be seen in
where an array or
processors (PRC) and memories (MEM)
can be mapped such that a memory over
a faulty processor can be remapped to a
neighboring processor, and vice-versa.
This is only effective if the mapping
can be made within the same number
of cycles, which is enabled by the close
proximity of this logical repair.
Gates per mm
with evolving nodes.
Self-healing methodology for co-designed arrays.