A write-up is here: https://abstractnonsense.com/crosswords.html
And you can play the crosswords here: https://crosswordracing.com (They should work well on both desktop and mobile, and there's a leader-board for each crossword if you want to leave your name when you solve one).
[1]: Just in case anyone is interested, my very first attempt at this problem was way back in 2006! I used multiple wordlists (e.g. list of British monarchs, with reign dates), and wrote little functions to generate clues from each list (e.g. "British monarch who ruled from {date1} to {date2}"). Even with randomized synonym substitution and similar tricks, this approach was too labor-intensive, and the results too robotic, for it to work well. Can't complain though, that project led to me getting hired as the first engineer at Justin.TV!
As someone who has dabbled in AI generated crosswords I found that providing samples of "good crossword clues" (which I curated from historical NYT monday puzzles) as part of the LLM context helped tremendously in generating better clues.
There was also a Show HN for a generative AI crossword puzzle system a few months ago so I'll include what I mentioned there:
Part of the deep satisfaction in solving a crossword puzzle is the specificity of the answer. It's far more gratifying to answer a question with something like "Hawking" then to answer with "scientist", or answering with "mandelbrot" versus "shape".
So ideally, you want to lean towards "specificity" wherever possible, and use "generics" as filler.
Link:
In some of my crosswords I get clues that are specific in clever ways (e.g. one of these has "Extreme, not camping" which I thought was really strange until I found the answer "intense" and was very impressed by that level of wordplay from an LLM!)
Funny, I just posted this to X
2025 GenAI challenge
Create a 5x5 crossword puzzle with two distinct solutions. Each clue must work for both solutions. Do not use the same word in both solutions. No black squares.
I try with each new model that lands. Still can’t get it.
Generating a 5x5 word square (with different words across and down, so not of the "Sator Arepo" variety) is already really hard for a human. I plugged the Wordle target word list into https://github.com/Quuxplusone/xword/blob/master/src/xword-f... to get a bunch of plausible squares like this:
SCALD
POLAR
ARTSY
CEASE
ERROR
But you want two word squares that can plausibly be clued together, which is (not impossible, but) difficult if matching entries aren't the same part of speech. For example, cluing "POLAR" together with "ARTSY" (both adjectives) seems likely more doable than cluing "POLAR" together with "LASSO" (noun or verb).Anyway, here's my attempt at a human solution, using the grid above — and another grid, which I'll challenge you to find from these clues. Hint: All but two of the ten pairs match, part-of-speech-wise.
1A. Remove the outer layer of, perhaps
2A. Region on a globe
3A. Like some movie theaters
4A. Command to a lawbreaker
5A. Rhyme for Tom Lehrer?
1D. ____yard (sometime sci-fi setting)
2D. It goes something like this: Ꮎ
3D. Feature of liturgy, often
4D. It's vacuous, in a sense
5D. Fino, vis-a-vis Pedro XiménezAsk the LLM to generate a program to solve the problem.
That's a wonderfully hard problem, I'd love to see it get solved.
> Once we have a grid, we try to fill it with words! I use simple backtracking search for that, with a timeout to stop the search on grids that are likely impossible to fill. In practice it's easy to generate a new filled grid from scratch about once every two minutes.
Have you explored other search techniques?
> After the grid is full of words, we use an LLM to generate some clues. I've iterated over many models and prompts for this.
Could you share the prompts and the models you tried?
Shameless plug: I've been interested in crossword generation for a while as well and made that toy: https://github.com/super7ramp/croiseur. No grid generation but automatic filling and clue generation. Clues are not really good, currently using gpt-4o-mini.
Warning: post contains a spoiler for a recent Xordle.
Xordle is Wordle with two target words that share no letters in common. Additionally, there is a "free clue" given at the start, and all three words are thematically linked. It's not always a straightforward link, for example a recent puzzle had the starter word 'grief' and targets 'empty' and 'chair'. All puzzles today are selected from user submissions.
o1 is the first model that's been able to solve Xordles reliably, or to generate valid puzzles at all. It's well-known that these things are massively handicapped for this type of task due to tokenization.
But since o1 can in fact achieve it, I wanted to see if I could get it to make puzzles that are at all satisfying. Instead it makes very bland puzzles, with straightforward connections and extremely broad themes.
Prompting can swing the pendulum too far in the other direction, to puzzles where the connection is contrived and impossible to see even after it's solved. As I've often experienced with LLMs, being able to hit either side of a target with prompting does not necessarily mean you can get it to land in the middle, and in fact I have had no success in doing so with this task.
This is one of the most basic examples I know of lack of creativity or "taste" to an LLM. It is a little hard for a human to generate two 5-letter words with no overlap, but it is extremely easy for a human to look for a thematic connection among 2-3 words and say if it's satisfying. But so far I've been totally unable to make the LLM make satisfying puzzles.
edit: Nothin' like making a claim about LLMs to get one up off one's ass and try to prove it wrong immediately. I'm getting some much better results with better examples now.
W P G
H I S T O R Y
E O
R T Y U M
L E S S P
I O C A T
U S E R A D D
L T R D C
Of course this is a pretty small grid and it gets more difficult with size. I've thought about making a competition from this sort of challenge. Would anyone be interested?- Every cell must be "keyed," i.e., part of a word Across and a word Down. Unkeyed cells are strictly forbidden.
- No word may be less than 3 letters. Two-letter words are strictly forbidden.
- The grid must be rotationally symmetric. (But this rule can be broken for fun. Bilaterally symmetric grids are relatively common these days. Totally asymmetric grids are very rare and always in service of some kind of fun — see https://www.xwordinfo.com/Thumbs?select=symmetry )
- No more than one-sixth of the squares can be black. (But this rule can be broken, usually either to make the puzzle less challenging by shortening the average word length, or to make the creator's life easier in order to achieve some other feat.)
- If a single black square is bordered on two adjoining sides by other black squares, then it could be turned white without destroying the other properties of the grid. Such black squares are called "cheaters" and are frowned upon. (Though they might serve a purpose, e.g. to fit a specific theme entry's length.)
The link at the bottom doesn’t work.
The grids shown do not follow the well-known rules of (American) crosswords: every square is part of two words of three or more letters each.
Coming up with a pattern of black squares, and writing good clues, are two parts of making a crossword puzzle that are IMO fun and benefit from a human touch, and are not overly difficult. There are also databases of past clues used in crossword puzzles (eg every NY Times clue ever, and various crossword dictionaries) for reference and possible training. If you don’t care about originality (or copyright) and want quality clues, you can just pull clues from these. If you do care about all those things, you can surface the list of clues used in the past to the human constructor and let them write the final clue. Or you can try to perfect LLM clue-writing. In my experience, LLMs are terrible at clues. Like sometimes if I try to give it feedback about a clue, it will just work the feedback into the clue… it’s a little hard to describe without an example, but basically it doesn’t seem to understand the requirements of a clue and the process of a solver looking at a clue and trying to come up with an answer.
Coming up with an interlocking set of fun, high-quality words and phrases is the hard part. I agree that LLM wordlist curation is a great idea, and I started playing around with that once.
Beyond that, I don’t think LLMs can help with grid construction, which is a more classic combinatorial problem.
Can you clarify which link is broken and how? What browser and OS?
> In my experience, LLMs are terrible at clues.
That hasn't been my experience. Without good prompting they give you clues that are too bland and literal, but it is quite possible to get them to give you clues with interesting and creative wordplay. I wish it was easier to get clues like that more consistently, but it's certainly doable. I still believe within a year it'll be easy.