we considered which one adheres to the prompt more, which one has overall best aesthetics etc but ended up with a simple which one is overall better type question. it is easier for people to vote and decide one and still applicable as preference data at a larger scope (trading volume for simplicity).
the dataset is open source and we plan to train an aesthetics picker on it but obviously have to do proper evals (with at least 1M data) to come to a reasonable conclusion.