My reading the various papers on synthesis as a non-hardware guy made me think this job shouldn't be as hard on that as the SOC's whose components vary considerably in individual attributes.
http://store.digilentinc.com/arty-artix-7-fpga-development-b...
As noted the Digilent Arty is $99 and hosts up to 32 cores. The XC7020 Zynq devices should host 80. That includes the Zedboard, the original Parallella kickstarter ed., the forthcoming Snickerdoodle Black (?), and the Digilent Pynq which is $65 Q1 for students. It is my intention to put out a version of GRVI Phalanx for 7020s, at least a bitstream and SDK, perhaps more, but much to do. Note the 7 series (including XC7A35T of Arty and the XC7Z020) have BRAMs but not UltraRAMs so those clusters have 4K instruction RAMs and 32K shared cluster RAMs. The 4-8K/128K clusters possiblr om the new UltraScale+ devices afford more breathing room for code and data per cluster.
They presented a short paper at FCCM '16: http://fpga.org/wp-content/uploads/2016/05/grvi_phalanx_fccm... Section VI lists possible applications.
"GRVI Phalanx aspires to make it easier to develop and maintain an FPGA accelerator for a parallel software workload. Some workloads will fit its mold, i.e. highly parallel SPMD or MIMD code with small kernels, local shared memory, and global message passing. Here are some parallel models that should map fairly well to a GRVI Phalanx framework:
• OpenCL kernels: run each work group on a cluster;
• ‘Gatling gun’ parallel packet processing: send each new packet to an idle cluster, which may exclusively work on that packet for up to (#clusters) packet-time-periods.
• OpenMP/TBB: run MIMD tasks within a cluster;
• Streaming data through process networks: pass streams as messages within a cluster, or between clusters;
• Compositions of such models.
Since GRVI Phalanx is implemented in an FPGA, these and other parallel models may then be further accelerated via custom GRVI and cluster function units; custom memories and interconnects; and custom standalone accelerator cores on cluster RAM or directly connected on the NOC."