undefined | Better HN

0 pointsbarking_biscuit3y ago0 comments

At this point, if it goes from being in the bottom 10% on a simulated bar exam to top 10% on a simulated bar exam, then who cares if that's all they're doing???

0 comments

cma3y ago

OpenAI writes in the post:

> A minority of the problems in the exams were seen by the model during training

A minority can be 49%. They do mention they tested against newly available practice exams, but those are often based on older real exam questions which may have been discussed extensively in forums that were in the training data. Now that it is for-profit ClosedAI we have to somewhat treat each claim as if it were made adversarially, assuming minority may mean 49% when it would benefit them one way and .1% when it serves their look better for sales pitch to the Microsoft board, etc.

MarioMan3y ago

There's no need to be quite so adversarial in this case though. The methodology is explained by the report:

> A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.

cma3y ago

I hadn't seen the technical report: it is good they made an attempt to remove them, but they only use substring match of 50 characters to find duplicates. Forum discussions after an exam are usually peoples more fuzzy memories of the question (it is impressive if it can convert back at test time from people's fuzzy memories, but still potentially taking an exam from the past where it has had access to the questions, especially the hard ones which get discussed the most).

From the results before and after removing some of the duplicates it doesn't seem to have hurt its performance badly though. Sometimes the score increases, so the substring approach may be helping it by excluding question variants with matching substring that it memorized but then the real test varied somewhere outside of the sampled substrings and had a different answer (or it random chance that the extrapolated score increased with some questions removed).

itake3y ago

If they are overfitting, then its not very interesting.

l33t2333723y ago

Humans overfit when they go to law school.

j / k navigate · click thread line to collapse

0 comments

cma3y ago

OpenAI writes in the post:

> A minority of the problems in the exams were seen by the model during training

MarioMan3y ago

There's no need to be quite so adversarial in this case though. The methodology is explained by the report:

cma3y ago

itake3y ago

If they are overfitting, then its not very interesting.

l33t2333723y ago

Humans overfit when they go to law school.

j / k navigate · click thread line to collapse