Some sort of "generate descriptions of novel tasks including ways to evaluate performance at those tasks, evaluate quality of the generated tasks+evaluation-metrics, split tasks into subtasks, estimate difficulty of tasks in a way that is is judged on how it compares to a combined estimated difficulty of generated subtasks and to actual success rate and quality" sort of deal?