GRVI Phalanx joins The Kilocore Club (opens in new tab)

(fpga.org)

71 pointsgeolqued9y ago21 comments

21 comments

15 comments · 4 top-level

quickben9y ago· 5 in thread

If I'm not mistaken, the devices go for 5-6k for the eval kit?

duskwuff9y ago

Yep. The Arty board (which ran 32 cores) is $99, though, which is actually a lot more interesting.

http://store.digilentinc.com/arty-artix-7-fpga-development-b...

agumonkey9y ago

Is it better than the Zynq platform ? Parallella still sells SBC with a dual core arm + fpga for 100$ IIRC

1 more reply

quickben9y ago

Now that, seems interesting. Thanks!

geolquedOP9y ago

This one? USD7k https://www.xilinx.com/products/boards-and-kits/ek-u1-vcu118...

jsgray9y ago

The board used (VCU118) is $7000.

As noted the Digilent Arty is $99 and hosts up to 32 cores. The XC7020 Zynq devices should host 80. That includes the Zedboard, the original Parallella kickstarter ed., the forthcoming Snickerdoodle Black (?), and the Digilent Pynq which is $65 Q1 for students. It is my intention to put out a version of GRVI Phalanx for 7020s, at least a bitstream and SDK, perhaps more, but much to do. Note the 7 series (including XC7A35T of Arty and the XC7Z020) have BRAMs but not UltraRAMs so those clusters have 4K instruction RAMs and 32K shared cluster RAMs. The 4-8K/128K clusters possiblr om the new UltraScale+ devices afford more breathing room for code and data per cluster.

vvanders9y ago· 4 in thread

Do I dare even ask how long the timing and routing pass took on 1680 cores?

jsgray9y ago

This build took 11 hours on an Intel Skulltrail NUC w/ 32 GB DRAM, but I believe there are ways to speed this up going forwards (incremental and/or hierarchical builds, out of context synthesis). For example, the XCVI9P is a 3 die (3 "super logic region") device and by setting up a hierarchical design flow I think I can place and route each SLR separately (at the same time) across more (x86) cores on my build box. The inter-SLR interconnect nets are just some quite regular 300b wide Hoplite NOC links and clock and reset.

nickpsecurity9y ago

Can't the tools do it relatively fast with a geometric method if the individual cores already have area/timing data to use and are homogenous? And a FPGA instead of an ASIC?

My reading the various papers on synthesis as a non-hardware guy made me think this job shouldn't be as hard on that as the SOC's whose components vary considerably in individual attributes.

jsgray9y ago

I wish it were so. While it is straightforward to do regular placement at the block level or even at the individual LUT/slice level using RPMs (relationally placed macros) or absolute LOC placement of LUTs in the XDC implementation constraints file, most of the implementation time goes into routing and there is not an easy mainstream way to take a routed one tile design and step and repeat it (say) 210 times across the die. In part this is due to non homogeneity across the columns and sometimes rows of the chip.

1 more reply

jackyinger9y ago

If you floor plan it yourself you can make the tool's job a lot easier. You can create a grid of regions on the FPGA and assign a core to each.

ethagknight9y ago· 2 in thread

What does one do with such a cluster?

RandomOpinion9y ago

>What does one do with such a cluster?

They presented a short paper at FCCM '16: http://fpga.org/wp-content/uploads/2016/05/grvi_phalanx_fccm... Section VI lists possible applications.

jaipilot7479y ago

For the lazy like me:

"GRVI Phalanx aspires to make it easier to develop and maintain an FPGA accelerator for a parallel software workload. Some workloads will fit its mold, i.e. highly parallel SPMD or MIMD code with small kernels, local shared memory, and global message passing. Here are some parallel models that should map fairly well to a GRVI Phalanx framework:

• OpenCL kernels: run each work group on a cluster;

• ‘Gatling gun’ parallel packet processing: send each new packet to an idle cluster, which may exclusively work on that packet for up to (#clusters) packet-time-periods.

• OpenMP/TBB: run MIMD tasks within a cluster;

• Streaming data through process networks: pass streams as messages within a cluster, or between clusters;

• Compositions of such models.

Since GRVI Phalanx is implemented in an FPGA, these and other parallel models may then be further accelerated via custom GRVI and cluster function units; custom memories and interconnects; and custom standalone accelerator cores on cluster RAM or directly connected on the NOC."

frozenport9y ago

Now use it for a barrel processor!

j / k navigate · click thread line to collapse

21 comments

15 comments · 4 top-level

quickben9y ago· 5 in thread

If I'm not mistaken, the devices go for 5-6k for the eval kit?

duskwuff9y ago

Yep. The Arty board (which ran 32 cores) is $99, though, which is actually a lot more interesting.

http://store.digilentinc.com/arty-artix-7-fpga-development-b...

agumonkey9y ago

Is it better than the Zynq platform ? Parallella still sells SBC with a dual core arm + fpga for 100$ IIRC

1 more reply

quickben9y ago

Now that, seems interesting. Thanks!

geolquedOP9y ago

This one? USD7k https://www.xilinx.com/products/boards-and-kits/ek-u1-vcu118...

jsgray9y ago

The board used (VCU118) is $7000.

vvanders9y ago· 4 in thread

Do I dare even ask how long the timing and routing pass took on 1680 cores?

jsgray9y ago

nickpsecurity9y ago

Can't the tools do it relatively fast with a geometric method if the individual cores already have area/timing data to use and are homogenous? And a FPGA instead of an ASIC?

My reading the various papers on synthesis as a non-hardware guy made me think this job shouldn't be as hard on that as the SOC's whose components vary considerably in individual attributes.

jsgray9y ago

1 more reply

jackyinger9y ago

If you floor plan it yourself you can make the tool's job a lot easier. You can create a grid of regions on the FPGA and assign a core to each.

ethagknight9y ago· 2 in thread

What does one do with such a cluster?

RandomOpinion9y ago

>What does one do with such a cluster?

They presented a short paper at FCCM '16: http://fpga.org/wp-content/uploads/2016/05/grvi_phalanx_fccm... Section VI lists possible applications.

jaipilot7479y ago

For the lazy like me:

• OpenCL kernels: run each work group on a cluster;

• ‘Gatling gun’ parallel packet processing: send each new packet to an idle cluster, which may exclusively work on that packet for up to (#clusters) packet-time-periods.

• OpenMP/TBB: run MIMD tasks within a cluster;

• Streaming data through process networks: pass streams as messages within a cluster, or between clusters;

• Compositions of such models.

frozenport9y ago

Now use it for a barrel processor!

j / k navigate · click thread line to collapse