I wish I had an Alveo board. Sheesh, now would be the time. You could fit the whole DAG into HBM.
Even with HBM on the card... while it will hash fast, you're still going through the memory controller. What you need to do is skip that. Get rid of the memory hardness entirely. Generate the DAG on demand.
[1] https://github.com/ethereum-cat-herders/progpow-audit/blob/m...