It also has a bandwidth problem. If 64 threads are vying for access, you either build it with few access ports and it gets choked, or you build it with many access ports which is costly in area, power, & speed.
Two separate peer caches automatically have twice the bandwidth of one similar double-size cache, for the price of NUMA & cache coherency challenges.
There is no one right answer here. Bandwidth is far more important and coherency much easier in a small L1; as you go down the hierarchy, bandwidth needs shrink and coherency is more expensive.