It can be checked if the model predicts canonical solutions from humaneval. I understand it is not ideal, but at least you can check it yourself
There are a bunch of other benchmarks too, check out the page https://huggingface.co/smallcloudai/Refact-1_6B-fim
Also, feel free to run any new benchmarks