AI World Clocks (opens in new tab)

(clocks.brianmoore.com)

1383 pointswaxpancake7mo ago384 comments

"Every minute, a new clock is rendered by nine different AI models."

384 comments

239 comments · 101 top-level

baltimore7mo ago· 34 in thread

Since the first (good) image generation models became available, I've been trying to get them to generate an image of a clock with 13 instead of the usual 12 hour divisions. I have not been successful. Usually they will just replace the "12" with a "13" and/or mess up the clock face in some other way.

I'd be interested if anyone else is successful. Share how you did it!

Scene_Cast27mo ago

I've noticed that image models are particularly bad at modifying popular concepts in novel ways (way worse "generalization" than what I observe in language models).

emp173447mo ago

Maybe LLMs always fail to generalize outside their data set, and it’s just less noticeable with written language.

3 more replies

CobrastanJorji7mo ago

Also, they're fundamentally bad at math. They can draw a clock because they've seen clocks, but going further requires some calculations they can't do.

For example, try asking Nano Banana to do something simpler, like "draw a picture of 13 circles." It likely will not work.

deathanatos7mo ago

  Generate an image of a clock face, but instead of the usual 12 hour numbering, number it with 13 hours.

Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it these days. https://imgur.com/a/1sSeFX7

A normal (ish) 12h clock. It numbered it twice, in two concentric rings. The outer ring is normal, but the inner ring numbers the 4th hour as "IIII" (fine, and a thing that clocks do) and the 8th hour as "VIIII" (wtf).

bar000n7mo ago

It should be pretty clear already that anything which is based (limited?) to communicating words/text can never grasp conceptual thinking.

We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.

4 more replies

andix7mo ago

I gave this "riddle" to various models:

> The farmer and the goat are going to the river. They look into the sky and see three clouds shaped like: a wolf, a cabbage and a boat that can carry the farmer and one item. How can they safely cross the river?

Most of them are just giving the result to the well known river crossing riddle. Some "feel" that something is off, but still have a hard time to figure out that wolf, boat and cabbage are just clouds.

jampa7mo ago

There are few examples of this as well:

https://www.reddit.com/r/singularity/comments/1fqjaxy/contex...

1 more reply

Recursing7mo ago

Claude has no problem with this: https://imgur.com/a/ifSNOVU

Maybe older models?

1 more reply

userbinator7mo ago

Basically a variation of https://en.wikipedia.org/wiki/Age_of_the_captain

echelon7mo ago

That's just a patch to the training data.

Once companies see this starting to show up in the evals and criticisms, they'll go out of their way to fix it.

rideontime7mo ago

What would the "patch" be? Manually create some images of 13-hour clocks and add them to the training data? How does that solution scale?

godelski7mo ago

s/13/17/g ;)

BrandoElFollito7mo ago

This is really cool. I tried to prompt gemini but every time I got the same picture. I do not know how to share a session (like it is possible with Chatgpt) but the prompts were

If a clock had 13 hours, what would be the angle between two of these 13 hours?

Generate an image of such a clock

No, I want the clock to have 13 distinct hours, with the angle between them as you calculated above

This is the same image. There need to be 13 hour marks around the dial, evenly spaced

... And its last answer was

You are absolutely right, my apologies. It seems I made an error and generated the same image again. I will correct that immediately.

Here is an image of a clock face with 13 distinct hour marks, evenly spaced around the dial, reflecting the angle we calculated.

And the very same clock, with 12 hours, and a 13th above the 12...

ryandrake7mo ago

This is probably my biggest problem with AI tools, having played around with them more lately.

"You're absolutely right! I made a mistake. I have now comprehensively solved this problem. Here is the corrected output: [totally incorrect output]."

None of them ever seem to have the ability to say "I cannot seem to do this" or "I am uncertain if this is correct, confidence level 25%" The only time they will give up or refuse to do something is when they are deliberately programmed to censor for often dubious "AI safety" reasons. All other times, they come back again and again with extreme confidence as they totally produce garbage output.

2 more replies

notatoad7mo ago

you can click the share icon (the two-way branch icon, it doesn't look like apple's share icon) under the image it generates to share the conversation.

i'm curious if the clock image it was giving you was the same one it was giving me

https://gemini.google.com/share/780db71cfb73

1 more reply

edub7mo ago

I was able to have AI generate an image that made this, but not by diffusion/autoregressive but by having it write Python code to create the image.

ChatGPT made a nice looking clock with matplotlib that had some bugs that it had to fix (hours were counter-clockwise). Gemini made correct code one-shot, it used Pillow instead of matplotlib, but it didn't look as nice.

giancarlostoro7mo ago

Weird, I never tried that, I tried all the usual tricks that usually work including swearing at the model (this scarily works surprisingly well with LLMs) and nothing. I even tried to go the opposite direction, I want a 6 hour clock.

nl7mo ago

I do playing card generation and almost all struggle beyond the "6 of X"

My working theory is that they were trained really hard to generate 5 fingers on hands but their counting drops off quickly.

IAmGraydon7mo ago

That's because they literally cannot do that. Doing what you're asking requires an understanding of why the numbers on the clock face are where they are and what it would mean if there was an extra hour on the clock (ie that you would have to divide 360 by 13 to begin to understand where the numbers would go). AI models have no concept of anything that's not included in their training data. Yet people continue to anthropomorphize this technology and are surprised when it becomes obvious that it's not actually thinking.

energy1237mo ago

The hope was for this understanding to emerge as the most efficient solution to the next-token prediction problem.

Put another way, it was hoped that once the dataset got rich enough, developing this understanding is actually more efficient for the neural network than memorizing the training data.

The useful question to ask, if you believe the hope is not bearing fruit, is why. Point specifically to the absent data or the flawed assumption being made.

Or more realistically, put in the creative and difficult research work required to discover the answer to that question.

bobbylarrybobby7mo ago

It's interesting because if you asked them to write code to generate an SVG of a clock, they'd probably use a loop from 1 to 12, using sin and cos of the angle (given by the loop index over 12 times 2pi) to place the numerals. They know how to do this, and so they basically understand the process that generates a clock face. And extrapolating from that to 13 hours is trivial (for a human). So the fact that they can't do this extrapolation on their own is very odd.

echelon7mo ago

gpt-image-1 and Google Imagen understand prompts, they just don't have training data to cover these use cases.

gpt-image-1 and Imagen are wickedly smart.

The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

1 more reply

ryandrake7mo ago

I wonder if you would have more success if you painstakingly described the shape and features of a clock in great detail but never used the words clock or time or anything that might give the AI the hint that they were supposed to output something like a clock.

1 more reply

Workaccount27mo ago

The problem is more likely the tokenization of images than anything. These models do their absolute worst when pictures are involved, but are seemingly miraculous at generalizing with just text.

1 more reply

godelski7mo ago

Yes, the problem is that these so called "world models" do not actually contain a model of the world, or any world

chanux7mo ago

Ah! This is so sad. The manager types won't be able to add an hour (actually, two) to the day even with AI.

snek_case7mo ago

From my experience they quickly fail to understand anything beyond a superficial description of the image you want.

atorodius7mo ago

That's less and less true

https://minimaxir.com/2025/11/nano-banana-prompts/

1 more reply

usui7mo ago

I've been trying for the longest time and across models to generate pictures or cartoons of people with six fingers and now they won't do it. They always say they accomplished it, but the result always has 5 fingers. I hate being gaslit.

coffeecoders7mo ago

LLMs are terrible for out-of-distribution (OOD) tasks. You should use chain of thought suppression and give constaints explictly.

My prompt to Grok:

---

Follow these rules exactly:

- There are 13 hours, labeled 1–13.

- There are 13 ticks.

- The center of each number is at angle: index * (360/13)

- Do not infer anything else.

- Do not apply knowledge of normal clocks.

Use the following variables:

HOUR_COUNT = 13

ANGLE_PER_HOUR = 360 / 13 // 27.692307°

Use index i ∈ [0..12] for hour marks:

angle_i = i * ANGLE_PER_HOUR

I want html/css (single file) of a 13-hour analog clock.

---

Output from grok.

https://jsfiddle.net/y9zukcnx/1/

chemotaxis7mo ago

> Follow these rules exactly:

"Here's the line-by-line specification of the program I need you to write. Write that program."

2 more replies

BrandoElFollito7mo ago

Well, that's cheating :) You asked it to generate code, which is ok because it does not represent a direct generated image of a clock.

Can grok generate images? What would the result be?

I will try your prompt on chatgpt and gemini

1 more reply

chiwilliams7mo ago

I'll also note that the output isn't quite right --- the top number should be 13 rather than 1!

1 more reply

NooneAtAll37mo ago

close enough, but digit at the top should be the highest, not 1 :/

lanewinfield7mo ago· 25 in thread

hi, I made this. thank you for posting.

I love clocks and I love finding the edges of what any given technology is capable of.

I've watched this for many hours and Kimi frequently gets the most accurate clock but also the least variation and is most boring. Qwen is often times the most insane and makes me laugh. Which one is "better?"

jdietrich7mo ago

Clock drawing is widely used as a test for assessing dementia. Sometimes the LLMs fail in ways that are fairly predictable if you're familiar with CSS and typical shortcomings of LLMs, but sometimes they fail in ways that are less obvious from a technical perspective but are exactly the same failure modes as cognitively-impaired humans.

I think you might have stumbled upon something surprisingly profound.

https://www.psychdb.com/cognitive-testing/clock-drawing-test

overfeed7mo ago

> Clock drawing is widely used as a test for assessing dementia

Interestingly, clocks are also an easy tell for when you're dreaming, if you're a lucid dreamer; they never work normally in dreams.

4 more replies

xrisk7mo ago

Maybe explainable via the fact that these tests are part of the LLM training set?

jorgesborges7mo ago

Conceptual deficit is a great failure mode description. The inability to retrieve "meaning" about the clock -- having some understanding about its shape and function but not its intent to convey time to us -- is familiar with a lot of bad LLM output.

BHSPitMonkey7mo ago

I would think the way humans draw clocks has more in common with image generation models (which probably do a bit better with this task overall) than a language model producing SVG markup, though.

ACCount377mo ago

LLMs don't do this because they have "people with dementia draw clocks that way" in their data. They do it because they're similar enough to human minds in function that they often fail in similar ways.

An amusing pattern that dates back to "1kg of steel is heavier of course" in GPT-3.5.

1 more reply

TheJoeMan7mo ago

Figure 6 with the square clock would be a cool modern art piece.

1 more reply

bspammer7mo ago

If you're keeping all the generated clocks in a database, I'd love to see a Facemash style spin-off website where users pick the best clock between two options, with a leaderboard. I want to know what the best clock Qwen ever made was!

abixb7mo ago

We might be on to creating a new crowd-ranked LLM benchmark here.

1 more reply

nightpool7mo ago

Yes! Please do this

layer87mo ago

Not the best, but the most amusing.

smusamashah7mo ago

Please make it show last 5 (or some other number) of clocks for each model. It will be nice to see the deviation and variety for each model at a glance.

charliewallace7mo ago

Very cool! I also love clocks, especially weird ones, and recently put up this 3D Moebius Strip clock, hope you like it: https://www.mobiusclock.com

chemotaxis7mo ago

This is honestly the best thing I've seen on HN this month. It's stupid, enlightening... funny and profound and the same time. I have a strong temptation to pick some of these designs and build them in real life.

I applaud you for spending money to get it done.

AnonHP7mo ago

Could you please change and adjust the positions of the titles (like GPT 5)? On Firefox Focus on iOS, the spacing is inconsistent (seems like it moves due to the space taken by the clock). After one or two of them, I had to scroll all the way down to the bottom and come back up to understand which title is linked to which clock.

anigbrowl7mo ago

I really like this. The broken ones are sometimes just failures, but sometimes provide intriguing new design ideas.

jdiff7mo ago

This same principle is why my favorite image generation model is the earlier models from 2019-2020 where they could only reliably generate soup. It's like Rorschach tests, it's not about what's there, it's about what you see in them. I don't want a bot to make art for me, sometimes I just want some shroom-induced inspirational smears.

1 more reply

ks20487mo ago

Nice job! Maybe let users click an example to see the raw source (LLM output)

brianjking7mo ago

This is an awesome benchmark. Officially one of my favorites now. Thank you for making this.

csours7mo ago

LOVE IT!

It would be really cool if I could zoom out and have everything scale properly!

Fabricio207mo ago

Why is this different per user? I sent this to a few friends and they all see different things from what i'm seeing, for the same time..?

samtheprogram7mo ago

It regenerates on page load. I find that pretty useful.

Grok 4 and Kimi nailed it the first time for me, then only Kimi on the second pass.

1 more reply

layer87mo ago

It’s different per minute, not per user.

hakcermani7mo ago

.. would you mind sharing the prompt .. in a gist perhaps .

ceroxylon7mo ago

They have it available on the site under the (?) button:

"Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting."

otterley7mo ago· 23 in thread

Watching this over the past few minutes, it looks like Kimi K2 generates the best clock face most consistently. I'd never heard of that model before today!

Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.

jquery7mo ago

I’ve been using Kimi K2 a lot this month. Gives me Japanese->English translations at near human levels of quality, while respecting rules and context I give it in a very long, multi-page system prompt to improve fidelity of translation for a given translation target (sometimes markup tags need to be preserved, sometimes deleted, etc.). It doesn’t require a thinking step to generate this level of translation quality, making it suitable for real-time translation. It doesn’t start getting confused when I feed it a couple dozen lines of previous translation context, like certain other LLMs do… instead the translation actually improves with more context instead of degrading. It’s never refused a translation for “safety” purposes either (GPT and Gemini love to interrupt my novels and tell me certain behavior is illegal or immoral, and censor various anatomical words).

komali27mo ago

> GPT and Gemini love to interrupt my novels and tell me certain behavior is illegal or immoral, and censor various anatomical words

Lol, are you using ai to create fan translations of エロ漫画 ?

1 more reply

frizlab7mo ago

I knew of Kimi K2 because it’s the model used by Kagi to generate the AI answers when query ends with an interrogation point.

OJFord7mo ago

It's also one of the few 'recommended' models in Kagi Assistant (multi-model ChatGPT basically, available on paid plans).

Bolwin7mo ago

Really? They must've switched recently cause that was around before kimi came out

1 more reply

frankfrank137mo ago

I find that Kimi K2 looks the best, but i've noticed the time is often wrong!

Mistletoe7mo ago

Qwen's clocks are highly entertaining. Like if you asked an alien "make me a clock".

bArray7mo ago

It could be that the prompt is accidentally (or purposefully) more optimised for Kimi K2, or that Kimi K2 is better trained on this particular data. LLM's need "prompt engineers" for a reason to get the most out of a particular model.

bigfishrunning7mo ago

How much engineering do prompt engineers do? Is it engineering when you add "photorealistic. correct number of fingers and teeth. High quality." to the end of a prompt?

we should call them "prompt witch doctors" or maybe "prompt alchemists".

9 more replies

andix7mo ago

I think the selection of models is a bit off. Haiku instead of Sonnet for example. Kimi K2's capabilities are closer to Sonnet than to Haiku. GPT-5 might be in the non-reasoning mode, which routes to a smaller model.

1 more reply

energy1237mo ago

Goes to show the "frontier" is not really one frontier. It's a social/mathematical construct that's useful for a broad comparison, but if you have a niche task, there's no substitute for trying the different models.

woodson7mo ago

Just use something like DSPy/Ax and optimize your module for any given LLM (based on sample data and metrics) and you’re mostly good. No need to manually wordsmith prompts.

observationist7mo ago

It's not fair to use prompts tailored to a particular model when doing comparisons like this - one shot results that generalize across a domain demonstrate solid knowledge of the domain. You can use prompting and context hacking to get any particular model to behave pseudo-competently in almost any domain, even the tiny <1B models, for some set of questions. You could include an entire framework and model for rendering clocks and times that allowed all 9 models to perform fairly well.

This experiment, however, clearly states the goal with this prompt: `Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.`

An LLM should be able to interpret that, and should be able to perform a wide range of tasks in that same style - countdown timers, clocks, calendars, floating quote bubble cycling through list of 100 pithy quotations, etc. Individual, clearly defined elements should have complex representations in latent space that correspond to the human understanding of those elements. Tasks and operations and goals should likewise align with our understanding. Qwen 2.5 and some others clearly aren't modeling clocks very well, or maybe the html/css rendering latents are broken. If you pick a semantic axis(like analog clocks), you can run a suite of tests to demonstrate their understanding by using limited one-shot interactions.

Reasoning models can adapt on the fly, and are capable of cheating - one shots might have crappy representations for some contexts, but after a lot of repetition and refinement, as long as there's a stable, well represented proxy for quality somewhere in the semantics it understands, it can deconstruct a task to fundamentals and eventually reach high quality output.

These type of tests also allow us to identify mode collapses - you can use complex sophisticated prompting to get most image models to produce accurate analog clocks displaying any time, but in the simple one shot tests, the models tend to only be able to produce the time 10:10, and you'll get wild artifacts and distortions if you try to force any other configuration of hands.

Image models are so bad at hands that they couldn't even get clock hands right, until recently anyway. Nano banana and some other models are much better at avoiding mode collapses, and can traverse complex and sophisticated compositions smoothly. You want that same sort of semantic generalization in text generating models, so hopefully some of the techniques cross over to other modalities.

I keep hoping they'll be able to use SAE or some form of analysis on static weight distributions in order to uncover some sort of structural feature of mode collapse, with a taxonomy of different failure modes and causes, like limited data, or corrupt/poisoned data, and so on. Seems like if you had that, you could deliberately iterate on, correct issues, or generate supporting training material to offset big distortions in a model.

2 more replies

nightpool7mo ago

It would be cool to also AI generate the favicon using some sort of image model.

paulddraper7mo ago

Kimi K2 is legitimately good.

oaktowner7mo ago

Perhaps Qwen 2.5 should be known as Dali 2.‽

stogot7mo ago

When I clicked, everything was garbage except Grok and DeepSeek. kimi was the worst clock

abixb7mo ago

>Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.

More like fell headfirst into the ground.

I'm disappointed with Gemini 2.5 (not sure Pro or Flash) -- I've personally had _fantastic_ results with Gemini 2.5 Pro building PWA, especially since the May 2025 "coding update." [0]

[0] https://blog.google/products/gemini/gemini-2-5-pro-updates/

dilap7mo ago

I'm a huge K2 fan, it has a personality that feels very distinct from other models (not syccophantic at all), and is quite smart. Also pretty good at creative writing (tho not 100% slop free).

K2 hosted on groq is pretty crazy for intellgence/second. (Low rate limits still, tho.)

basch7mo ago

my GPT-40 was 100% perfect on the first click. Since then, garbage. Gemini 2.5 perfect on the 3rd click.

buffaloPizzaBoy7mo ago

Right as you said that, I checked kimi k2’s “clock” and it was just the ascii art: ¯\_(ツ)_/¯

I wonder if that is some type of fallback for errors querying the model, or k2 actually created the html/css to display that.

kbar137mo ago

i noticed the second hand is off tho. gemini has the most accurate one.

wowczarek7mo ago

Interestingly, either I'm _hallucinating_ this, or DeepSeek started to consistently show a clock without failures and with good time, where it previously didn't. ...aaand as I was typing this, it barfed a train wreck. Never mind, move along... No, wait, it's good again, no, wait...

munro7mo ago· 7 in thread

Amazing, some people are so enamored with LLMs who use them for soft outcomes, and disagree with me when I say be careful they're not perfect -- this is such a great non technical way to explain the reality I'm seeing when using on hard outcome coding/logic tasks. "Hey this test is failing", LLM deletes test, "FIXED!"

derbOac7mo ago

Something that struck me when I was looking at the clocks is that we know what a clock is supposed to look and act like.

What about when we don't know what it's supposed to look like?

Lately I've been wrestling with the fact that unlike, say, a generalized linear model fit to data with some inferential theory, we don't have a theory or model for the uncertainty about LLM products. We recognize when it's off about things we know are off, but don't have a way to estimate when it's off other than to check it against reality, which is probably the exception to how it's used rather than the rule.

ehnto7mo ago

I need to be delicate with wording here, but this is why it's a worry that all the least intelligent people you know could be using AI.

It's why non-coders think it's doing an amazing job at software.

But it's worryingly why using it for research, where you necessarily don't know what you don't know, is going to trip up even smarter people.

1 more reply

munro7mo ago

I built an ML classifier for product categories way back, as I added more classes/product types, individual class PR metrics improved--I kept adding more and more until I ended up with ~2,000 classes.

My intuition is at the start when I was like "choose one of these 10 or unknown", that unknown left a big gray area, so as I added more classes the model could say "I know it's not X, because it's more similar to Y"

I feel like in this case though, the broken clocks are broken because they don't serve the purpose of visually transmitting information, they do look like clocks tho. I'm sure if you fed the output back into the LLM and ask what time it is it would say IDK, or more likely make something up and be wrong. (at least the egregious ones where the hands are flying everywhere)

worldsayshi7mo ago

Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

palmotea7mo ago

> Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

I disagree, those tasks are perfect for LLMs, since a bug you can't verify isn't a problem when vibecoding.

mopsi7mo ago

  > "Hey this test is failing", LLM deletes test, "FIXED!"

A nice continuation of the tradition of folk stories about supernatural entities like teapots or lamps that grant wishes and take them literally. "And that's why, kids, you should always review your AI-assisted commits."

markatkinson7mo ago

To be fair I'd probably also delete the test.

ryandrake7mo ago· 6 in thread

I've been struggling all week trying to get Claude Code to write code to produce visual (not the usual, verifiable, text on a terminal) output in the form of a SDL_GPU rendered scene consisting of the usual things like shaders, pipelines, buffers, textures and samplers, vertex and index data and so on, and boy it just doesn't seem to know what it's doing. Despite providing paragraphs-long, detailed prompts. Despite describing each uniform and each matrix that needs to be sent. Despite giving it extremely detailed guidance about what order things need to be done in. It would have been faster for me to just write the code myself.

When it fails a couple of times it will try to put logging in place and then confidently tell me things like "The vertex data has been sent to the renderer, therefore the output is correct!" When I suggest it take a screenshot of the output each time to verify correctness, it does, and then declares victory over an entirely incorrect screenshot. When I suggest it write unit tests, it does so, but the tests are worthless and only tests that the incorrect code it wrote is always incorrect in the same ways.

When it fails even more times, it will get into this what I like to call "intern engineer" mode where it just tries random things that I know are not going to work. And if I let it keep going, it will end up modifying the entire source tree with random "try this" crap. And each iteration, it confidently tells me: "Perfect! I have found the root cause! It is [garbage bullshit]. I have corrected it and the code is now completely working!"

These tools are cute, but they really need to go a long way before they are actually useful for anything more than trivial toy projects.

poszlem7mo ago

I’m not sure if it's just me, but I've also noticed Claude becoming even more lazy. For example, I've asked it several times to fix my tests. It'll fix four or five of them, then start struggling with the next couple, and suddenly declare something like: "All done, fixed 5 out of 10 tests. I can’t fix the remaining ones", followed by a long, convoluted explanation about why that’s actually a good thing.

__MatrixMan__7mo ago

I don't know if it has gotten worse, but I definitely find Claude is way too eager to celebrate success when it has done nothing.

It's annoying but I prefer it to how Gemini gets depressed if it takes a few tries to make progress. Like, thanks for not gaslighing me, but now I'm feeling sorry for a big pile of numbers, which was not a stated goal in my prompt.

rossant7mo ago

Have you tried OpenAI Codex with GPT5.1? I'm using it for similar GPU rendering stuff and it appears to do an excellent job.

fancy_pantser7mo ago

Have you given using MCPs to provide documentation and examples a shot? I always have to bring in docs since I don't work in Python and TS+React (which it seems more capable at) and force it to review those in addition to any specification. e.g. Context7

ryandrake7mo ago

Haven't looked into MCPs yet. Thanks for the suggestion!

jamilton7mo ago

I know this has been said many times before, but I wonder why this is such a common outcome. Maybe from negative outcomes being underrepresented in the training data? Maybe that plus being something slightly niche and complex?

The screenshot method not working is unsurprising to me, VLLMs visual reasoning is very bad with details because they (as far as I understand) do not really have access to those details, just the image embedding and maybe an OCR'd transcript.

porphyra7mo ago· 5 in thread

LLMs can't "look" at the rendered HTML output to see if what they generated makes sense or not. But there ought to be a way to do that right? To let the model iterate until what it generates looks right.

Currently, at work, I'm using Cursor for something that has an OpenGL visualization program. It's incredibly frustrating trying to describe bugs to the AI because it is completely blind. Like I just wanna tell it "there's no line connecting these two points but there ought to be one!" or "your polygon is obviously malformed as it is missing a bunch of points and intersects itself" but it's impossible. I end up having to make the AI add debug prints to, say, print out the position of each vertex, in order to convince it that it has a bug. Very high friction and annoying!!!

firtoz7mo ago

Cursor has this with their "browser" function for web dev, quite useful

You can also give it a mcp setup that it can send a screenshot to the conversation, though unsure if anyone made an easy enough "take screenshot of a specific window id" kind of mcp, so may need to be built first

I guess you could also ask it to build that mcp for you...

EMM_3867mo ago

You can absolutely do this. In fact, with Claude Anthropic encourages you to send it screenshots. It works very well if you aren't expecting pixel-perfection.

YMMV with other models but Sonnet 4.5 is good with things like this - writing the code, "seeing" the output and then iterating on it.

pil0u7mo ago

I had some success providing screenshots to Cursor directly. It worked well for web UIs as well as generated graphs in Python. It makes them a bit less blind, though I feel more iterations are required.

fragmede7mo ago

Claude totally can, same with ChatGPT. Upload a picture to either one of them via the app and tell it there's no line where there should be. There’s some plumbing involved to get it to work in Claude code or codex, but yes, computers can "see". If you have lm-server, there's tons of non-text models you can point your code at.

TheKidCoder7mo ago

Kinda - Hand waiving over the question of if an LLM can really "look" but you can connect Cursor to a Puppeteer MCP server which will allow it to iterate with "eyes" by using Puppeteer to screenshot it's own output. Still has issues, but it does solve really silly mistakes often simply by having this MCP available.

kburman7mo ago· 5 in thread

These types of tests are fundamentally flawed. I was able to create perfect clock using gemini 2.5 pro - https://gemini.google.com/share/136f07a0fa78

Drew_7mo ago

The website is regenerating the clocks every minute. When I opened it, Gemini 2.5 was the only working one. Now, they are all broken.

Also, your example is not showing the current time.

1 more reply

dwringer7mo ago

Even Gemini Flash did really well for me[0] using two prompts - the initial query and one to fix the only error I could identify.

> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face.

Followed by:

> Currently the hands are working perfectly but they're translated incorrectly making then uncentered. Can you ensure that each one is translated to the correct position on the clock face?

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

allenu7mo ago

I don't think this is a serious test. It's just an art piece to contrast different LLMs taking on the same task, and against themselves since it updates every minute. One minute one of the results was really good for me and the next minute it was very, very bad.

jmdeon7mo ago

Aren't they attempting to also display current time though? Your share is a clock starting at midnight/noon. Kimi K2 seems to be the best on each refresh.

sinak7mo ago

How are they flawed?

1 more reply

em3rgent0rdr7mo ago· 4 in thread

Most look like they were done by a beginner programmer on crack, but every once in a while a correct one appears.

shafoshaf7mo ago

It's interesting how drawing a clock is one of the primary signals for dementia. https://www.verywellhealth.com/the-clock-drawing-test-98619

2 more replies

pixl977mo ago

DeepSeek and Kimi seem to have correct ones most of the time I've looked.

2 more replies

morkalork7mo ago

I'd say more like a blind programmer in the early stages of dementia. Able to write code, unable to form a mental image of what it would render as and can't see the final result.

energy1237mo ago

If they can identify which one is correct, then it's the same as always being correct, just with an expensive compute budget.

ugh1237mo ago· 3 in thread

Cool, and marginally informative on the current state of things. but kind of a waste of energy given everything is re-done every minute to compare. We'd probably only need a handful of each to see the meaningful differences.

whoisjuan7mo ago

It's actually quite fascinating if you watch it for 5 minutes. Some models are overall bad, but others nail it in one minute and butcher it in the next.

It's perhaps the best example I have seen of model drift driven by just small, seemingly unimportant changes to the prompt.

4 more replies

energy1237mo ago

I sort of assumed they cached like 30 inferences and just repeat them, but maybe I'm being too cynical.

ascorbic7mo ago

The energy usage is minuscule.

2 more replies

anon_cow11117mo ago· 2 in thread

I'm having a hard time believing this site is honest, especially with how ridiculous the scaling and rotation of numbers is for most of them. I dumped his prompt into chatgpt to try it myself and it did create a very neat clock face with the numbers at the correct position+animated second hand, it just got the exact time wrong, being a few hours off.

Edit: the time may actually have been perfect now that I account for my isp's geo-located time zone

Zopieux7mo ago

On the contrary, in my experience this is very typical of the average failure mode / output of early 2025 LLMs for HTML of SVG.

perfmode7mo ago

i read that the OP limited the output to 2000 tokens.

2 more replies

earth2mars7mo ago· 2 in thread

https://gemini.google.com/share/00967146a995 works perfectly fine with gemini 2.5 pro

lanewinfield7mo ago

nice. I restrict to 2000 tokens for mine, how many was that?

esafak7mo ago

how do you do that?

2 more replies

Waterluvian7mo ago· 2 in thread

How do they do time without JavaScript? Is there an API I’m not aware of?

bloppe7mo ago

CSS animation. It's not the real time. Just a hypothetical time.

1 more reply

bhandziuk7mo ago

Looks like css keyframes

PeterStuer7mo ago· 2 in thread

Why? This is diagonal to how LLM's work, and trivially solved by a minimal hybrid front/sub system.

bayindirh7mo ago

Because, LLMs are touted to be the silver bullet of silver bullets. Built upon world's knowledge, and with the capacity to call upon updated information with agents, they are ought to rival the top programmers 3 days ago.

1 more reply

em3rgent0rdr7mo ago

To gauge.

kylecazar7mo ago· 1 in thread

Non-determinism at it's finest. The clock is perfect, the refresh happens, the clock looks like a Dali painting.

jeremycarter7mo ago

Last year I wrote a simple system using Semantic Kernel, backed by functions inside Microsoft Orleans, which for the most part was a business logic DSL processor by LLM. Your business logic was just text, and you gave it the operation as text.

Nothing could be relied upon to be deterministic, it was so funny to see it try to do operations.

Recently I re-ran it with newer models and was drastically better, especially with temperature tweaks.

anotheryou7mo ago· 1 in thread

Claude Sonnet 4.5 with a little thinking: https://imgur.com/a/zcJOnKy

no thinking: better clock but not current time (the prompt is confusing here though): https://imgur.com/a/kRK3Q18

themgt7mo ago

Just saw Gemini 2.5 with a little thinking: https://imgur.com/a/nypRD7x

ada19817mo ago· 1 in thread

Sonnet 4.5 did this easily https://claude.ai/public/artifacts/c1bb5d57-573b-49e0-9539-7...

fouc7mo ago

The catch was that it was limited to 2000 tokens, i.e. the results get cut off once it hits that.

1 more reply

S0y7mo ago· 1 in thread

To be fair, This is a deceptively hard task.

bobbylarrybobby7mo ago

Without AI assistance, this should take ~10–15 minutes for a human. Maybe add 5 minutes if you're not allowed to use d3.

3 more replies

adi_kurian7mo ago· 1 in thread

Think this is just prompt eng tbh. One shot Haiku 3.5 (https://claude.ai/share/66c17968-485e-4d15-974b-4f6958e1e2fd) decent looking too.

Got it to work on gpt 3.5T w modified prompt (albeit not as good - https://pastebin.com/gjEVSEcJ)

`single html file, working analog clock showing current time, numbers positioned (aligned) correctly via trig calc (dynamic), all three hands, second hand ticks, 400px, clean AF aesthetic R/Greenberg Associates circa 2017. empathy, hci, define > design > implement.`

fouc7mo ago

The catch was that it was limited to 2000 tokens, i.e. the results get cut off once it hits that.

1 more reply

anonzzzies7mo ago· 1 in thread

Sonnet 4.5 does it flawless. Tried 8 times.

fouc7mo ago

The catch was that it was limited to 2000 tokens, i.e. the results get cut off once it hits that.

syx7mo ago· 1 in thread

I’m very curious about the monthly bill for such a creative project, surely some of these are pre rendered?

coffeecoders7mo ago

Napkin math:

9 AIs × 43,200 minutes = 388,800 requests/month

388,800 requests × 200 tokens = 77,760,000 tokens/month ≈ 78M tokens

Cost varies from 10 cents to $1 per 1M tokens.

Using the mid-price, the cost is around $50/month.

---

Hopefully, the OP has this endpoint protected - https://clocks.brianmoore.com/api/clocks?time=11:19AM

2 more replies

rtcode_io7mo ago· 1 in thread

See https://clock.rt.ht/::code

AI-optimized <analog-clock>!

People expect perfection on first attempt. This took a brief joint session:

HI: define the custom element API design (attribute/property behavior) and the CSS parts

AI: draw the rest of the f… owl

speedgoose7mo ago

This is a white page, am I missing something?

1 more reply

kfarr7mo ago· 1 in thread

Add some voting and you got yourself an AI World Clock arena! https://artificialanalysis.ai/image/arena

BrandoElFollito7mo ago

Thank you very much.... It was a fun game until I got to the prompt

Place a baby elephant in the green chair

I cannot unsee what I saw and it is 21:30 here so I have an hour or so to eliminate the picture from my mind or I will have nightmares.

hansmayer7mo ago· 1 in thread

Very funny. It seems the Qwen generates the funniest outputs :)

csours7mo ago

Oh, Qwen, buddy, you sure are TRYING

system27mo ago· 1 in thread

Ask Claude or ChatGPT to write it in Python, and you will see what they are capable of. HTML + CSS has never been the strong suit of any of these models.

camalouu7mo ago

Claude generates some js/css stuff even when i don't ask for it. I think Claude itself at least believes he is good at this.

xyproto7mo ago· 1 in thread

Try adding to the prompt that it has a PhD in Computer Science and have many methods for dealing with complexity.

This gives better results, at least for me.

bigfishrunning7mo ago

Why does that give better results? Is this phenomena measurable? How would "you have a phd in computer science" change its ability to interpret prose? Every interaction with an LLM seems like superstition.

1 more reply

abathologist7mo ago· 1 in thread

This is great. If you think that the phenomena of human-like text generation evinces human-like intelligence, then this should be taken to evince that the systems likely have dementia. https://en.wikipedia.org/wiki/Montreal_Cognitive_Assessment

AIorNot7mo ago

Imagine if I asked you to draw as pixels and operate a clock via html or create a jpeg with a pencil and paper and have it be accurate.. I suspect your handcoded work to be off by an order of magnitutde compared

zkmon7mo ago· 1 in thread

Was Claude banned from this Olympics?

giancarlostoro7mo ago

Haiku is the lightweight Claude model, I'm not sure why they picked the weaker model.

RugnirViking7mo ago· 1 in thread

whats going on with kimi k2 and being reasonable/so unique in so many of these benchmarks ive seen recently? I will have to try it out further for stuff. is it any good at programming?

Bolwin7mo ago

Yes, it trades blows with glm for the best open source model

bigbluedots7mo ago· 1 in thread

Is there a "draw a pelican riding a bicycle" version?

padolsey7mo ago

We've done this! https://weval.org/analysis/visual__pelican/f141a8500de7f37f/...

accrual7mo ago· 1 in thread

I love that GPT-5 is putting the clock hands way outside the frame and just generally is a mess. Maybe we'll look back on these mistakes just like watching kids grow up and fumble basic tasks. Humorous in its own unique way.

palmotea7mo ago

> Maybe we'll look back on these hilarious mistakes just like watching kids grow up and fumble basic tasks.

Or regret: "why didn't we stop it when we could?"

larodi7mo ago· 1 in thread

would be gr8t to also see the prompt this was done with

creade7mo ago

The ? has "Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting."

1 more reply

zkmon7mo ago

Why are Deepseek and Kimi are beating other models by so much margin? Is this to do with their specialization for this task?

1 more reply

bongodongobob7mo ago

Weird. Sonnet 4.5 one shotted it with:

Create an interactive artifact of an analog clock face that keeps time properly.

https://claude.ai/public/artifacts/75daae76-3621-4c47-a684-d...

paxys7mo ago

Something I'm not able to wrap my head around is that Kimi K2 is the only model that produces a ticking second hand on every attempt while the rest of them are always moving continuously. What fundamental differences in model training or implementation can result in this disparity? Or was this use case programmed in K2 after the fact?

mandolingual7mo ago

Always interesting/uncanny when AI is tested with human cognitive tests https://www.psychdb.com/cognitive-testing/clock-drawing-test.

busymom07mo ago

Because a new clock is generated every minute, looks like simply changing the time by a digit causes the result to be significantly different from the previous iteration.

edfletcher_t1377mo ago

Lack of Claude is a glaring oversight given how popular it is as an agentic coding model...

gwbas1c7mo ago

Reminds me of the Alzheimer's "draw a clock" test.

Makes me think that LLMs are like people with dementia! Perhaps it's the best way to relate to an LLM?

chaosprint7mo ago

This is such a great idea! Surprisingly, the Kimi K2 is the only one without any obvious problems. And it is even not the complete K2 thinking version? This made me reread this article from a few days ago:

https://entropytown.com/articles/2025-11-07-kimi-k2-thinking...

amelius7mo ago

Maybe they can ask Sora to make variations of:

https://slate.com/human-interest/2016/07/martin-baas-giant-r...

cornonthecobra7mo ago

I like Deepseek v3.1's idea of radially-aligning each hour number's y-axis ("1" is rotated 30° from vertical, "2" at 60°, etc.). It would be even better if the numbers were rotated anticlockwise.

I'm not sure what Qwen 2.5 is doing, but I've seen similar in contemporary art galleries.

wanderingmind7mo ago

The more I look at it, the more I realise the reason for cognitive overload I feel when using LLMs for coding. Same prompt to same model for a pretty straight forward task produces such wildly different outputs. Now, imagine how wildly different the code outputs when trying to generate two different logical functions. The casings are different, commenting is different, no semantic continuity. Now maybe if I give detailed prompts and ask it to follow, it might follow, but from my experience prompt adherence is not so great as well. I am at the stage where I just use LLMs as auto correct, rather than using it for any generation.

buzzm7mo ago

Wonderful. I don’t particularly care if it is or is not a valid test. I like the “wrong” renderings better. Some are hilarious, some … inspired.

Bengalilol7mo ago

Qwen doesn't care about clocks, it goes the Dali way, without melting.

It even made a Nietzsche clock (I saw one <body> </body> which was surprisingly empty).

It definitely wins the creative award.

collimarco7mo ago

In any case those clocks are all extremely inaccurate, even if AI could build a decent UI (which is not the case).

Some months ago I published this site for fun: https://timeutc.com There's a lot of code involved to make it precise to the ms, including adjusting based on network delay, frame refresh rate instead of using setTimeout and much more. If you are curious take a look at the source code.

ticulatedspline7mo ago

This is cool, interesting to see how consistent some models are (both in success and failure)

I tried gpt-oss-20b (my go-to local) and it looks ok though not very accurate. It decided to omit numbers. It also took 4500 tokens while thinking.

I'd be interested in seeing it with some more token leeway as well as comparing two or more similar prompts. like using "current time" instead of "${time}" and being more prescriptive about including numbers

eastbound7mo ago

Security-wise, this is a website that takes the straight output of AI and serves it for execution on their website.

I know, developers do the same, but at least they check it in Git to notice their mistakes. Here is an opportunity for AI to call a Google Authentication on you, or anything else.

nasir7mo ago

where's opus/sonnet! very curious on that!

whimsicalism7mo ago

Kimi K2 is obviously the best, but gpt-5 has the most gorgeous ones when it works

Vera_Wilde7mo ago

It's really beautiful! Super clean UI.

The thing I always want from timezone tools is: “Let me simulate a date after one side has shifted but the other hasn’t.”

Humans do badly with DST offset transitions; computers do great with them.

orly017mo ago

What does it mean that each model is allowed 2000 tokens to generate its clock?

arendtio7mo ago

Pretty cool already!

I use 'Sonnet 4.5 thinking' and 'Composer 1' (Cursor) the most, so it would be interesting to see how such SOTA models perform in this task.

fschuett7mo ago

Reminds me of this: https://www.youtube.com/watch?v=OGbhJjXl9Rk

bpt37mo ago

It's wild how much the output varies for the same model for each run.

I'm not sure if this was the intent or not, but it sure highlights how unreliable LLMs are.

bigbluedots7mo ago

I just realized I'm running late, it's almost -2!

More seriously, I'd love to see how the models perform the same task with a larger token allowance.

aavshr7mo ago

just curious, why not the sonnet models? In my personal experience, Anthropic's Sonnet models are the best when it comes to things like this!

bwhiting23567mo ago

You should render it, show an image to the model and allow it to iterate. No person has to one-shot code without seeing what it looks like.

boxedemp7mo ago

That's super neat. I'll keep checking back to this site as new models are released. It's an interesting benchmark.

3oil37mo ago

I wonder which model will silently be updated and suddenly start drawing clocks with Audemars-Piguet-level kind of complications.

wewtyflakes7mo ago

It is funny to see the performance improve across many of the models, somewhat miraculously, throughout the day today.

JamesAdir7mo ago

I believe that in a day or two, the companies will address this and it would be solved by them for that use case

bitwize7mo ago

I'm reminded of the "draw a clock" test neurologists use to screen for dementia and brain damage.

maxdo7mo ago

Selection of western models is weird no gpt-5.1 , opus 4.1 ( nailed it perfectly ) Something I quickly tested

shahzaibmushtaq7mo ago

Interesting idea!

Why is a new clock being rendered every minute? Or AI models are evolving and improving every minute.

josfredo7mo ago

Watching these gives me a strong feeling of unease. Art-wise, it is a very beautiful project.

DeathArrow7mo ago

How can Deepseek and Kimi get it right while Haiku, Gemini and GPT are making a mess?

__fst__7mo ago

This is why we need TeraWatt DCs, to generate code for world clocks every minute.

imchillyb7mo ago

I love qwen, it tries so hard with its little paddle and never gets anywhere.

HarHarVeryFunny7mo ago

Looks like we've got a new Turing test here: "draw me a clock"

Imanari7mo ago

Qwens clocks are hilarious

esotericwarfare7mo ago

This is an AD for Kimi K2

novemp7mo ago

Oh cool, it's the schizophrenia clock-drawing test but for AI.

Zeraous7mo ago

How Kımı is better than other BILLION$ companys is really fun

teaearlgraycold7mo ago

Qwen 2.5 doing a surprisingly good job (as of right now).

0xCE07mo ago

Seems like Will's clock drawing test in Hannibal :)

stym067mo ago

If a human had done this, these would be at a museum

AlfredBarnes7mo ago

Its cool to see them get it right .....sometimes

ssl-37mo ago

This really needs to be an xscreensaver hack.

jcmontx7mo ago

Grok is impressive, I should give it a shot

lovegrenoble7mo ago

Are they part of the LLM training set?

gloosx7mo ago

anyone tried opening this from mobile? not a single clock renders correctly, almost looks like a joke on LLMs

surfingdino7mo ago

What a wonderfully visual example of the crap LLMs turn everything into. I am eagerly awaiting the collapse of the LLM bubble. JetBrains added this crap to their otherwise fine series of IDEs and now I have to keep removing randomly inserted import statements and keep fixing hallucinated names of functions suggested instead of the names of functions that I have already defined in the same file. Lack of determinism where we expect it (most of the things we do, tbh) is creating more problems than it is solving.

mstipetic7mo ago

GPT-5 is embarrassing itself. Kimi and DeepSeek are very consistently good. Wild that you can just download these models.

bananatron7mo ago

grok's looks like one of those clocks you'd find at a novelty shop

hollow-moe7mo ago

obviously they're all broken on firefox, no one uses firefox anyways

fnord777mo ago

whatever model Cursor uses was telling me the date was March 12, 2023

woopwoop7mo ago

The qwen clocks are art.

miohtama7mo ago

The new Turing time test

shubham_zingle7mo ago

not sure about the accuracy though, although shooting in the dark

lxe7mo ago

Honestly, I think if you track the performance of each over time, since these get regenerated once in a while, you can then have a very, very useful and cohesive benchmark.

silexia7mo ago

Grok is hilarious

baidoct7mo ago

GPT-5 looks broken

1yvino7mo ago

i wonder kwen prompt woud look like hallucination?

cyberjill7mo ago

666

adriatp7mo ago

deepseek representing

shevy-java7mo ago

Now that is actually creative.

Granted, it is not a clock - but it could be art. It looks like a Picasso. When he was drunk. And took some LSD.

jonplackett7mo ago

kimi is kicking ass

kwanbix7mo ago

What a waste of energy.

warpspin7mo ago

Lol. This is supposed to replace me at my job already?

Great experiment!

jsmo7mo ago

lol

awkwam7mo ago

Limiting the model to only use 2000 tokens while also asking it to output ONLY HTML/CSS is just stupid. It's like asking a programmer to perform the same task while removing half their brain and also forget about their programming experience. This is a stupid and meaningless benchmark.

j / k navigate · click thread line to collapse

384 comments

239 comments · 101 top-level

baltimore7mo ago· 34 in thread

I'd be interested if anyone else is successful. Share how you did it!

Scene_Cast27mo ago

I've noticed that image models are particularly bad at modifying popular concepts in novel ways (way worse "generalization" than what I observe in language models).

emp173447mo ago

Maybe LLMs always fail to generalize outside their data set, and it’s just less noticeable with written language.

3 more replies

CobrastanJorji7mo ago

Also, they're fundamentally bad at math. They can draw a clock because they've seen clocks, but going further requires some calculations they can't do.

For example, try asking Nano Banana to do something simpler, like "draw a picture of 13 circles." It likely will not work.

deathanatos7mo ago

  Generate an image of a clock face, but instead of the usual 12 hour numbering, number it with 13 hours.

Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it these days. https://imgur.com/a/1sSeFX7

bar000n7mo ago

It should be pretty clear already that anything which is based (limited?) to communicating words/text can never grasp conceptual thinking.

We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.

4 more replies

andix7mo ago

I gave this "riddle" to various models:

jampa7mo ago

There are few examples of this as well:

https://www.reddit.com/r/singularity/comments/1fqjaxy/contex...

1 more reply

Recursing7mo ago

Claude has no problem with this: https://imgur.com/a/ifSNOVU

Maybe older models?

1 more reply

userbinator7mo ago

Basically a variation of https://en.wikipedia.org/wiki/Age_of_the_captain

echelon7mo ago

That's just a patch to the training data.

Once companies see this starting to show up in the evals and criticisms, they'll go out of their way to fix it.

rideontime7mo ago

What would the "patch" be? Manually create some images of 13-hour clocks and add them to the training data? How does that solution scale?

godelski7mo ago

s/13/17/g ;)

BrandoElFollito7mo ago

This is really cool. I tried to prompt gemini but every time I got the same picture. I do not know how to share a session (like it is possible with Chatgpt) but the prompts were

If a clock had 13 hours, what would be the angle between two of these 13 hours?

Generate an image of such a clock

No, I want the clock to have 13 distinct hours, with the angle between them as you calculated above

This is the same image. There need to be 13 hour marks around the dial, evenly spaced

... And its last answer was

You are absolutely right, my apologies. It seems I made an error and generated the same image again. I will correct that immediately.

Here is an image of a clock face with 13 distinct hour marks, evenly spaced around the dial, reflecting the angle we calculated.

And the very same clock, with 12 hours, and a 13th above the 12...

ryandrake7mo ago

This is probably my biggest problem with AI tools, having played around with them more lately.

"You're absolutely right! I made a mistake. I have now comprehensively solved this problem. Here is the corrected output: [totally incorrect output]."

2 more replies

notatoad7mo ago

you can click the share icon (the two-way branch icon, it doesn't look like apple's share icon) under the image it generates to share the conversation.

i'm curious if the clock image it was giving you was the same one it was giving me

https://gemini.google.com/share/780db71cfb73

1 more reply

edub7mo ago

I was able to have AI generate an image that made this, but not by diffusion/autoregressive but by having it write Python code to create the image.

giancarlostoro7mo ago

nl7mo ago

I do playing card generation and almost all struggle beyond the "6 of X"

My working theory is that they were trained really hard to generate 5 fingers on hands but their counting drops off quickly.

IAmGraydon7mo ago

energy1237mo ago

The hope was for this understanding to emerge as the most efficient solution to the next-token prediction problem.

Put another way, it was hoped that once the dataset got rich enough, developing this understanding is actually more efficient for the neural network than memorizing the training data.

The useful question to ask, if you believe the hope is not bearing fruit, is why. Point specifically to the absent data or the flawed assumption being made.

Or more realistically, put in the creative and difficult research work required to discover the answer to that question.

bobbylarrybobby7mo ago

echelon7mo ago

gpt-image-1 and Google Imagen understand prompts, they just don't have training data to cover these use cases.

gpt-image-1 and Imagen are wickedly smart.

The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

1 more reply

ryandrake7mo ago

1 more reply

Workaccount27mo ago

The problem is more likely the tokenization of images than anything. These models do their absolute worst when pictures are involved, but are seemingly miraculous at generalizing with just text.

1 more reply

godelski7mo ago

Yes, the problem is that these so called "world models" do not actually contain a model of the world, or any world

chanux7mo ago

Ah! This is so sad. The manager types won't be able to add an hour (actually, two) to the day even with AI.

snek_case7mo ago

From my experience they quickly fail to understand anything beyond a superficial description of the image you want.

atorodius7mo ago

That's less and less true

https://minimaxir.com/2025/11/nano-banana-prompts/

1 more reply

usui7mo ago

coffeecoders7mo ago

LLMs are terrible for out-of-distribution (OOD) tasks. You should use chain of thought suppression and give constaints explictly.

My prompt to Grok:

---

Follow these rules exactly:

- There are 13 hours, labeled 1–13.

- There are 13 ticks.

- The center of each number is at angle: index * (360/13)

- Do not infer anything else.

- Do not apply knowledge of normal clocks.

Use the following variables:

HOUR_COUNT = 13

ANGLE_PER_HOUR = 360 / 13 // 27.692307°

Use index i ∈ [0..12] for hour marks:

angle_i = i * ANGLE_PER_HOUR

I want html/css (single file) of a 13-hour analog clock.

---

Output from grok.

https://jsfiddle.net/y9zukcnx/1/

chemotaxis7mo ago

> Follow these rules exactly:

"Here's the line-by-line specification of the program I need you to write. Write that program."

2 more replies

BrandoElFollito7mo ago

Well, that's cheating :) You asked it to generate code, which is ok because it does not represent a direct generated image of a clock.

Can grok generate images? What would the result be?

I will try your prompt on chatgpt and gemini

1 more reply

chiwilliams7mo ago

I'll also note that the output isn't quite right --- the top number should be 13 rather than 1!

1 more reply

NooneAtAll37mo ago

close enough, but digit at the top should be the highest, not 1 :/

lanewinfield7mo ago· 25 in thread

hi, I made this. thank you for posting.

I love clocks and I love finding the edges of what any given technology is capable of.

jdietrich7mo ago

I think you might have stumbled upon something surprisingly profound.

https://www.psychdb.com/cognitive-testing/clock-drawing-test

overfeed7mo ago

> Clock drawing is widely used as a test for assessing dementia

Interestingly, clocks are also an easy tell for when you're dreaming, if you're a lucid dreamer; they never work normally in dreams.

4 more replies

xrisk7mo ago

Maybe explainable via the fact that these tests are part of the LLM training set?

jorgesborges7mo ago

BHSPitMonkey7mo ago

I would think the way humans draw clocks has more in common with image generation models (which probably do a bit better with this task overall) than a language model producing SVG markup, though.

ACCount377mo ago

An amusing pattern that dates back to "1kg of steel is heavier of course" in GPT-3.5.

1 more reply

TheJoeMan7mo ago

Figure 6 with the square clock would be a cool modern art piece.

1 more reply

bspammer7mo ago

abixb7mo ago

We might be on to creating a new crowd-ranked LLM benchmark here.

1 more reply

nightpool7mo ago

Yes! Please do this

layer87mo ago

Not the best, but the most amusing.

smusamashah7mo ago

Please make it show last 5 (or some other number) of clocks for each model. It will be nice to see the deviation and variety for each model at a glance.

charliewallace7mo ago

Very cool! I also love clocks, especially weird ones, and recently put up this 3D Moebius Strip clock, hope you like it: https://www.mobiusclock.com

chemotaxis7mo ago

I applaud you for spending money to get it done.

AnonHP7mo ago

anigbrowl7mo ago

I really like this. The broken ones are sometimes just failures, but sometimes provide intriguing new design ideas.

jdiff7mo ago

1 more reply

ks20487mo ago

Nice job! Maybe let users click an example to see the raw source (LLM output)

brianjking7mo ago

This is an awesome benchmark. Officially one of my favorites now. Thank you for making this.

csours7mo ago

LOVE IT!

It would be really cool if I could zoom out and have everything scale properly!

Fabricio207mo ago

Why is this different per user? I sent this to a few friends and they all see different things from what i'm seeing, for the same time..?

samtheprogram7mo ago

It regenerates on page load. I find that pretty useful.

Grok 4 and Kimi nailed it the first time for me, then only Kimi on the second pass.

1 more reply

layer87mo ago

It’s different per minute, not per user.

hakcermani7mo ago

.. would you mind sharing the prompt .. in a gist perhaps .

ceroxylon7mo ago

They have it available on the site under the (?) button:

otterley7mo ago· 23 in thread

Watching this over the past few minutes, it looks like Kimi K2 generates the best clock face most consistently. I'd never heard of that model before today!

Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.

jquery7mo ago

komali27mo ago

> GPT and Gemini love to interrupt my novels and tell me certain behavior is illegal or immoral, and censor various anatomical words

Lol, are you using ai to create fan translations of エロ漫画 ?

1 more reply

frizlab7mo ago

I knew of Kimi K2 because it’s the model used by Kagi to generate the AI answers when query ends with an interrogation point.

OJFord7mo ago

It's also one of the few 'recommended' models in Kagi Assistant (multi-model ChatGPT basically, available on paid plans).

Bolwin7mo ago

Really? They must've switched recently cause that was around before kimi came out

1 more reply

frankfrank137mo ago

I find that Kimi K2 looks the best, but i've noticed the time is often wrong!

Mistletoe7mo ago

Qwen's clocks are highly entertaining. Like if you asked an alien "make me a clock".

bArray7mo ago

bigfishrunning7mo ago

How much engineering do prompt engineers do? Is it engineering when you add "photorealistic. correct number of fingers and teeth. High quality." to the end of a prompt?

we should call them "prompt witch doctors" or maybe "prompt alchemists".

9 more replies

andix7mo ago

1 more reply

energy1237mo ago

woodson7mo ago

Just use something like DSPy/Ax and optimize your module for any given LLM (based on sample data and metrics) and you’re mostly good. No need to manually wordsmith prompts.

observationist7mo ago

2 more replies

nightpool7mo ago

It would be cool to also AI generate the favicon using some sort of image model.

paulddraper7mo ago

Kimi K2 is legitimately good.

oaktowner7mo ago

Perhaps Qwen 2.5 should be known as Dali 2.‽

stogot7mo ago

When I clicked, everything was garbage except Grok and DeepSeek. kimi was the worst clock

abixb7mo ago

>Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.

More like fell headfirst into the ground.

I'm disappointed with Gemini 2.5 (not sure Pro or Flash) -- I've personally had _fantastic_ results with Gemini 2.5 Pro building PWA, especially since the May 2025 "coding update." [0]

[0] https://blog.google/products/gemini/gemini-2-5-pro-updates/

dilap7mo ago

I'm a huge K2 fan, it has a personality that feels very distinct from other models (not syccophantic at all), and is quite smart. Also pretty good at creative writing (tho not 100% slop free).

K2 hosted on groq is pretty crazy for intellgence/second. (Low rate limits still, tho.)

basch7mo ago

my GPT-40 was 100% perfect on the first click. Since then, garbage. Gemini 2.5 perfect on the 3rd click.

buffaloPizzaBoy7mo ago

Right as you said that, I checked kimi k2’s “clock” and it was just the ascii art: ¯\_(ツ)_/¯

I wonder if that is some type of fallback for errors querying the model, or k2 actually created the html/css to display that.

kbar137mo ago

i noticed the second hand is off tho. gemini has the most accurate one.

wowczarek7mo ago

munro7mo ago· 7 in thread

derbOac7mo ago

Something that struck me when I was looking at the clocks is that we know what a clock is supposed to look and act like.

What about when we don't know what it's supposed to look like?

ehnto7mo ago

I need to be delicate with wording here, but this is why it's a worry that all the least intelligent people you know could be using AI.

It's why non-coders think it's doing an amazing job at software.

But it's worryingly why using it for research, where you necessarily don't know what you don't know, is going to trip up even smarter people.

1 more reply

munro7mo ago

worldsayshi7mo ago

Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

palmotea7mo ago

> Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

I disagree, those tasks are perfect for LLMs, since a bug you can't verify isn't a problem when vibecoding.

mopsi7mo ago

  > "Hey this test is failing", LLM deletes test, "FIXED!"

markatkinson7mo ago

To be fair I'd probably also delete the test.

ryandrake7mo ago· 6 in thread

These tools are cute, but they really need to go a long way before they are actually useful for anything more than trivial toy projects.

poszlem7mo ago

__MatrixMan__7mo ago

I don't know if it has gotten worse, but I definitely find Claude is way too eager to celebrate success when it has done nothing.

rossant7mo ago

Have you tried OpenAI Codex with GPT5.1? I'm using it for similar GPU rendering stuff and it appears to do an excellent job.

fancy_pantser7mo ago

ryandrake7mo ago

Haven't looked into MCPs yet. Thanks for the suggestion!

jamilton7mo ago

porphyra7mo ago· 5 in thread

firtoz7mo ago

Cursor has this with their "browser" function for web dev, quite useful

I guess you could also ask it to build that mcp for you...

EMM_3867mo ago

You can absolutely do this. In fact, with Claude Anthropic encourages you to send it screenshots. It works very well if you aren't expecting pixel-perfection.

YMMV with other models but Sonnet 4.5 is good with things like this - writing the code, "seeing" the output and then iterating on it.

pil0u7mo ago

fragmede7mo ago

TheKidCoder7mo ago

kburman7mo ago· 5 in thread

These types of tests are fundamentally flawed. I was able to create perfect clock using gemini 2.5 pro - https://gemini.google.com/share/136f07a0fa78

Drew_7mo ago

The website is regenerating the clocks every minute. When I opened it, Gemini 2.5 was the only working one. Now, they are all broken.

Also, your example is not showing the current time.

1 more reply

dwringer7mo ago

Even Gemini Flash did really well for me[0] using two prompts - the initial query and one to fix the only error I could identify.

Followed by:

> Currently the hands are working perfectly but they're translated incorrectly making then uncentered. Can you ensure that each one is translated to the correct position on the clock face?

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

allenu7mo ago

jmdeon7mo ago

Aren't they attempting to also display current time though? Your share is a clock starting at midnight/noon. Kimi K2 seems to be the best on each refresh.

sinak7mo ago

How are they flawed?

1 more reply

em3rgent0rdr7mo ago· 4 in thread

Most look like they were done by a beginner programmer on crack, but every once in a while a correct one appears.

shafoshaf7mo ago

It's interesting how drawing a clock is one of the primary signals for dementia. https://www.verywellhealth.com/the-clock-drawing-test-98619

2 more replies

pixl977mo ago

DeepSeek and Kimi seem to have correct ones most of the time I've looked.

2 more replies

morkalork7mo ago

I'd say more like a blind programmer in the early stages of dementia. Able to write code, unable to form a mental image of what it would render as and can't see the final result.

energy1237mo ago

If they can identify which one is correct, then it's the same as always being correct, just with an expensive compute budget.

ugh1237mo ago· 3 in thread

whoisjuan7mo ago

It's actually quite fascinating if you watch it for 5 minutes. Some models are overall bad, but others nail it in one minute and butcher it in the next.

It's perhaps the best example I have seen of model drift driven by just small, seemingly unimportant changes to the prompt.

4 more replies

energy1237mo ago

I sort of assumed they cached like 30 inferences and just repeat them, but maybe I'm being too cynical.

ascorbic7mo ago

The energy usage is minuscule.

2 more replies

anon_cow11117mo ago· 2 in thread

Edit: the time may actually have been perfect now that I account for my isp's geo-located time zone

Zopieux7mo ago

On the contrary, in my experience this is very typical of the average failure mode / output of early 2025 LLMs for HTML of SVG.

perfmode7mo ago

i read that the OP limited the output to 2000 tokens.

2 more replies

earth2mars7mo ago· 2 in thread

https://gemini.google.com/share/00967146a995 works perfectly fine with gemini 2.5 pro

lanewinfield7mo ago

nice. I restrict to 2000 tokens for mine, how many was that?

esafak7mo ago

how do you do that?

2 more replies

Waterluvian7mo ago· 2 in thread

How do they do time without JavaScript? Is there an API I’m not aware of?

bloppe7mo ago

CSS animation. It's not the real time. Just a hypothetical time.

1 more reply

bhandziuk7mo ago

Looks like css keyframes

PeterStuer7mo ago· 2 in thread

Why? This is diagonal to how LLM's work, and trivially solved by a minimal hybrid front/sub system.

bayindirh7mo ago

1 more reply

em3rgent0rdr7mo ago

To gauge.

kylecazar7mo ago· 1 in thread

Non-determinism at it's finest. The clock is perfect, the refresh happens, the clock looks like a Dali painting.

jeremycarter7mo ago

Nothing could be relied upon to be deterministic, it was so funny to see it try to do operations.

Recently I re-ran it with newer models and was drastically better, especially with temperature tweaks.

anotheryou7mo ago· 1 in thread

Claude Sonnet 4.5 with a little thinking: https://imgur.com/a/zcJOnKy

no thinking: better clock but not current time (the prompt is confusing here though): https://imgur.com/a/kRK3Q18

themgt7mo ago

Just saw Gemini 2.5 with a little thinking: https://imgur.com/a/nypRD7x

ada19817mo ago· 1 in thread

Sonnet 4.5 did this easily https://claude.ai/public/artifacts/c1bb5d57-573b-49e0-9539-7...

fouc7mo ago

The catch was that it was limited to 2000 tokens, i.e. the results get cut off once it hits that.

1 more reply

S0y7mo ago· 1 in thread

To be fair, This is a deceptively hard task.

bobbylarrybobby7mo ago

Without AI assistance, this should take ~10–15 minutes for a human. Maybe add 5 minutes if you're not allowed to use d3.

3 more replies

adi_kurian7mo ago· 1 in thread

Think this is just prompt eng tbh. One shot Haiku 3.5 (https://claude.ai/share/66c17968-485e-4d15-974b-4f6958e1e2fd) decent looking too.

Got it to work on gpt 3.5T w modified prompt (albeit not as good - https://pastebin.com/gjEVSEcJ)

fouc7mo ago

The catch was that it was limited to 2000 tokens, i.e. the results get cut off once it hits that.

1 more reply

anonzzzies7mo ago· 1 in thread

Sonnet 4.5 does it flawless. Tried 8 times.

fouc7mo ago

The catch was that it was limited to 2000 tokens, i.e. the results get cut off once it hits that.

syx7mo ago· 1 in thread

I’m very curious about the monthly bill for such a creative project, surely some of these are pre rendered?

coffeecoders7mo ago

Napkin math:

9 AIs × 43,200 minutes = 388,800 requests/month

388,800 requests × 200 tokens = 77,760,000 tokens/month ≈ 78M tokens

Cost varies from 10 cents to $1 per 1M tokens.

Using the mid-price, the cost is around $50/month.

---

Hopefully, the OP has this endpoint protected - https://clocks.brianmoore.com/api/clocks?time=11:19AM

2 more replies

rtcode_io7mo ago· 1 in thread

See https://clock.rt.ht/::code

AI-optimized <analog-clock>!

People expect perfection on first attempt. This took a brief joint session:

HI: define the custom element API design (attribute/property behavior) and the CSS parts

AI: draw the rest of the f… owl

speedgoose7mo ago

This is a white page, am I missing something?

1 more reply

kfarr7mo ago· 1 in thread

Add some voting and you got yourself an AI World Clock arena! https://artificialanalysis.ai/image/arena

BrandoElFollito7mo ago

Thank you very much.... It was a fun game until I got to the prompt

Place a baby elephant in the green chair

I cannot unsee what I saw and it is 21:30 here so I have an hour or so to eliminate the picture from my mind or I will have nightmares.

hansmayer7mo ago· 1 in thread

Very funny. It seems the Qwen generates the funniest outputs :)

csours7mo ago

Oh, Qwen, buddy, you sure are TRYING

system27mo ago· 1 in thread

Ask Claude or ChatGPT to write it in Python, and you will see what they are capable of. HTML + CSS has never been the strong suit of any of these models.

camalouu7mo ago

Claude generates some js/css stuff even when i don't ask for it. I think Claude itself at least believes he is good at this.

xyproto7mo ago· 1 in thread

Try adding to the prompt that it has a PhD in Computer Science and have many methods for dealing with complexity.

This gives better results, at least for me.

bigfishrunning7mo ago

1 more reply

abathologist7mo ago· 1 in thread

AIorNot7mo ago

zkmon7mo ago· 1 in thread

Was Claude banned from this Olympics?

giancarlostoro7mo ago

Haiku is the lightweight Claude model, I'm not sure why they picked the weaker model.

RugnirViking7mo ago· 1 in thread

whats going on with kimi k2 and being reasonable/so unique in so many of these benchmarks ive seen recently? I will have to try it out further for stuff. is it any good at programming?

Bolwin7mo ago

Yes, it trades blows with glm for the best open source model

bigbluedots7mo ago· 1 in thread

Is there a "draw a pelican riding a bicycle" version?

padolsey7mo ago

We've done this! https://weval.org/analysis/visual__pelican/f141a8500de7f37f/...

accrual7mo ago· 1 in thread

palmotea7mo ago

> Maybe we'll look back on these hilarious mistakes just like watching kids grow up and fumble basic tasks.

Or regret: "why didn't we stop it when we could?"

larodi7mo ago· 1 in thread

would be gr8t to also see the prompt this was done with

creade7mo ago

1 more reply

zkmon7mo ago

Why are Deepseek and Kimi are beating other models by so much margin? Is this to do with their specialization for this task?

1 more reply

bongodongobob7mo ago

Weird. Sonnet 4.5 one shotted it with:

Create an interactive artifact of an analog clock face that keeps time properly.

https://claude.ai/public/artifacts/75daae76-3621-4c47-a684-d...

paxys7mo ago

mandolingual7mo ago

Always interesting/uncanny when AI is tested with human cognitive tests https://www.psychdb.com/cognitive-testing/clock-drawing-test.

busymom07mo ago

Because a new clock is generated every minute, looks like simply changing the time by a digit causes the result to be significantly different from the previous iteration.

edfletcher_t1377mo ago

Lack of Claude is a glaring oversight given how popular it is as an agentic coding model...

gwbas1c7mo ago

Reminds me of the Alzheimer's "draw a clock" test.

Makes me think that LLMs are like people with dementia! Perhaps it's the best way to relate to an LLM?

chaosprint7mo ago

https://entropytown.com/articles/2025-11-07-kimi-k2-thinking...

amelius7mo ago

Maybe they can ask Sora to make variations of:

https://slate.com/human-interest/2016/07/martin-baas-giant-r...

cornonthecobra7mo ago

I like Deepseek v3.1's idea of radially-aligning each hour number's y-axis ("1" is rotated 30° from vertical, "2" at 60°, etc.). It would be even better if the numbers were rotated anticlockwise.

I'm not sure what Qwen 2.5 is doing, but I've seen similar in contemporary art galleries.

wanderingmind7mo ago

buzzm7mo ago

Wonderful. I don’t particularly care if it is or is not a valid test. I like the “wrong” renderings better. Some are hilarious, some … inspired.

Bengalilol7mo ago

Qwen doesn't care about clocks, it goes the Dali way, without melting.

It even made a Nietzsche clock (I saw one <body> </body> which was surprisingly empty).

It definitely wins the creative award.

collimarco7mo ago

In any case those clocks are all extremely inaccurate, even if AI could build a decent UI (which is not the case).

ticulatedspline7mo ago

This is cool, interesting to see how consistent some models are (both in success and failure)

I tried gpt-oss-20b (my go-to local) and it looks ok though not very accurate. It decided to omit numbers. It also took 4500 tokens while thinking.

eastbound7mo ago

Security-wise, this is a website that takes the straight output of AI and serves it for execution on their website.

I know, developers do the same, but at least they check it in Git to notice their mistakes. Here is an opportunity for AI to call a Google Authentication on you, or anything else.

nasir7mo ago

where's opus/sonnet! very curious on that!

whimsicalism7mo ago

Kimi K2 is obviously the best, but gpt-5 has the most gorgeous ones when it works

Vera_Wilde7mo ago

It's really beautiful! Super clean UI.

The thing I always want from timezone tools is: “Let me simulate a date after one side has shifted but the other hasn’t.”

Humans do badly with DST offset transitions; computers do great with them.

orly017mo ago

What does it mean that each model is allowed 2000 tokens to generate its clock?

arendtio7mo ago

Pretty cool already!

I use 'Sonnet 4.5 thinking' and 'Composer 1' (Cursor) the most, so it would be interesting to see how such SOTA models perform in this task.

fschuett7mo ago

Reminds me of this: https://www.youtube.com/watch?v=OGbhJjXl9Rk

bpt37mo ago

It's wild how much the output varies for the same model for each run.

I'm not sure if this was the intent or not, but it sure highlights how unreliable LLMs are.

bigbluedots7mo ago

I just realized I'm running late, it's almost -2!

More seriously, I'd love to see how the models perform the same task with a larger token allowance.

aavshr7mo ago

just curious, why not the sonnet models? In my personal experience, Anthropic's Sonnet models are the best when it comes to things like this!

bwhiting23567mo ago

You should render it, show an image to the model and allow it to iterate. No person has to one-shot code without seeing what it looks like.

boxedemp7mo ago

That's super neat. I'll keep checking back to this site as new models are released. It's an interesting benchmark.

3oil37mo ago

I wonder which model will silently be updated and suddenly start drawing clocks with Audemars-Piguet-level kind of complications.

wewtyflakes7mo ago

It is funny to see the performance improve across many of the models, somewhat miraculously, throughout the day today.

JamesAdir7mo ago

I believe that in a day or two, the companies will address this and it would be solved by them for that use case

bitwize7mo ago

I'm reminded of the "draw a clock" test neurologists use to screen for dementia and brain damage.

maxdo7mo ago

Selection of western models is weird no gpt-5.1 , opus 4.1 ( nailed it perfectly ) Something I quickly tested

shahzaibmushtaq7mo ago

Interesting idea!

Why is a new clock being rendered every minute? Or AI models are evolving and improving every minute.

josfredo7mo ago

Watching these gives me a strong feeling of unease. Art-wise, it is a very beautiful project.

DeathArrow7mo ago

How can Deepseek and Kimi get it right while Haiku, Gemini and GPT are making a mess?

__fst__7mo ago

This is why we need TeraWatt DCs, to generate code for world clocks every minute.

imchillyb7mo ago

I love qwen, it tries so hard with its little paddle and never gets anywhere.

HarHarVeryFunny7mo ago

Looks like we've got a new Turing test here: "draw me a clock"

Imanari7mo ago

Qwens clocks are hilarious

esotericwarfare7mo ago

This is an AD for Kimi K2

novemp7mo ago

Oh cool, it's the schizophrenia clock-drawing test but for AI.

Zeraous7mo ago

How Kımı is better than other BILLION$ companys is really fun

teaearlgraycold7mo ago

Qwen 2.5 doing a surprisingly good job (as of right now).

0xCE07mo ago

Seems like Will's clock drawing test in Hannibal :)

stym067mo ago

If a human had done this, these would be at a museum

AlfredBarnes7mo ago

Its cool to see them get it right .....sometimes

ssl-37mo ago

This really needs to be an xscreensaver hack.

jcmontx7mo ago

Grok is impressive, I should give it a shot

lovegrenoble7mo ago

Are they part of the LLM training set?

gloosx7mo ago

anyone tried opening this from mobile? not a single clock renders correctly, almost looks like a joke on LLMs

surfingdino7mo ago

mstipetic7mo ago

GPT-5 is embarrassing itself. Kimi and DeepSeek are very consistently good. Wild that you can just download these models.

bananatron7mo ago

grok's looks like one of those clocks you'd find at a novelty shop

hollow-moe7mo ago

obviously they're all broken on firefox, no one uses firefox anyways

fnord777mo ago

whatever model Cursor uses was telling me the date was March 12, 2023

woopwoop7mo ago

The qwen clocks are art.

miohtama7mo ago

The new Turing time test

shubham_zingle7mo ago

not sure about the accuracy though, although shooting in the dark

lxe7mo ago

Honestly, I think if you track the performance of each over time, since these get regenerated once in a while, you can then have a very, very useful and cohesive benchmark.

silexia7mo ago

Grok is hilarious

baidoct7mo ago

GPT-5 looks broken

1yvino7mo ago

i wonder kwen prompt woud look like hallucination?

cyberjill7mo ago

666

adriatp7mo ago

deepseek representing

shevy-java7mo ago

Now that is actually creative.