Indexing a year of video locally on a 2021 MacBook with Gemma4-31B (50GB swap) (opens in new tab)

(blog.simbastack.com)

471 pointsasenna1mo ago142 comments

142 comments

79 comments · 27 top-level

egorfine1mo ago· 12 in thread

Thanks for the article! I have a beefy M5 Pro and I'm eagerly looking around for ways to use local models (specifically Gemma4 & Qwen3.6).

This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.

satvikpendem1mo ago

Unsloth Studio [0] is what I recommend these days, open source alternative to the more widely known LM Studio, and also built by the people who make good quantizations of released models. With MTP support not merged in you should get 2x token generation speed with no accuracy difference. They also have MLX quants if you scroll down a bit, which is a format specifically for macOS' Metal GPU acceleration but that's not integrated into Unsloth Studio just yet.

[0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide

egorfine1mo ago

I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.

1 more reply

mft_1mo ago

I tried Unsloth Studio recently and was disappointed - in particular the downloading functionality is half-baked and didn’t cope with resuming downloads. As it seemed to just be a simple wrapper over llama.cpp, I found that huggingface hub, llama.cpp, and a couple of simple scripts actually offered better functionality once it was set up.

1 more reply

asennaOP1mo ago

Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.

You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!

Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".

But yeah, I think I'll try to do some more of these experiments soon.

endymi0n1mo ago

there’s a lot of open models out there… I told Claude to do a weighted score on several models and deduplicate by CLIP similarity for an expedition, should be easy to replicate (see below). Sure doesn’t select the absolute best pics from an emotional impact perspective, but it was pretty damn good at me not having to wade through the bottom 80% of mediocre shots and dupes!

—-

“Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.

Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”

1 more reply

busfahrer1mo ago

I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?

ahknight1mo ago

I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE

embedding-shape1mo ago

You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.

1 more reply

egorfine1mo ago

Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.

Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.

Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.

1 more reply

egorfine1mo ago

Qwen3.6 27B oQ6: 12.5 t/s generation, 340-360 t/s pp.

egorfine1mo ago

Native MCP:

For Qwen 35B enabling native MCP on MLX models slows it down by 10%.

For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.

(all tested on M5 pro).

1 more reply

juancn1mo ago

I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server

1 more reply

desro1mo ago· 6 in thread

> The skill is open at ~/.claude/skills/video-index/. If you're working on something similar (indexing personal archives, getting a local model to do real archival work, building agents that drive editing tools), I'd be glad to compare notes.

When your Claude wrote this post they might not have selected the right URL to share, unless your home folder is exposed. Care to share the skill files?

embedding-shape1mo ago

We just got a modern example of the classic message from a friend who just picked up programming, containing: "I just created my own web app, wanna check it out? It's here: http://localhost:8080"

0x38B1mo ago

Different context, but I sent a message like that in Signal the other day to a family member with a link to my IP, pointing to `Python -m http.server` running in a directory with a file for them to try (1). Easier than having them open my Samba share.

1: To get an Android app working that has been delisted and requires a 'key' app that you purchase. We did purchase it, but didn't think to make any backups.

m4631mo ago

reminds me of telling a friend:

I hacked your system: file:///etc/passwd

1 more reply

z21mo ago

I've been getting this weekly from colleagues. It's very much an epidemic right now! And the port number is indeed almost always a random number between 8000 and 8100.

2 more replies

asennaOP1mo ago

Oops! My bad. Fixing it now. And yeah, I can share the Skill file. Give me 5 mins.

asennaOP1mo ago

Ok I scrambled to finalize a name for it and create a new repo for it - https://github.com/Simbastack-hq/framedex

PS - I just put this together in the last few mins, removed my personal files and references. So it's not tested properly, please let me know if any issues.

It's still an early hack, but I have thousands of still images as well from my camera which I've not processed and I need to do the same analysis for those.

So I'll continue working on it, but happy to receive any PRs if anyone finds any use for it.

I'm tired of having a backlog of thousands of images and videos, leaving it for later.

2 more replies

theodorewiles1mo ago· 6 in thread

My take is that B2C AI applications are kind of structurally limited by how hard it is to build personalized context.

The idea of capable local models could be a huge unlock here if they are able to do the bottom-up context collection research / tagging / etc. at scale.

michaelbuckbee1mo ago

I made a B2C AI app that's fully local (and free) to do AI based contextual file renaming.

So if you give it a bunch of screenshots it will try and intelligently name them based upon what is in the screenshot. Same for videos, PDFs, etc.

But to your point I haven't even tried charging money as it feels like something Apple is just going to bake in as a feature.

https://finalfinalreallyfinaluntitleddocumentv3.com/

asennaOP1mo ago

This is cool. And yeah love the name!

Are you planning to open source it? Or maintain it in the future?

1 more reply

ntcho1mo ago

absolutely love the domain here. great taste

asennaOP1mo ago

Definitely agree with this. Here, me and Claude brainstorming together did that Research, and some trial-and-error to get to this.

But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.

enos_feedler1mo ago

Is it really local models that unlock this? Surely stateless model APIs would yield the same benefits? I get that local can be “cheaper” depending on usage, but we’ve been renting storage and compute from clouds at a premium for ages..

asennaOP1mo ago

A huge thing here was the massive amount of data that was just processed - I went through about 1TB of files over 24 hours.

Using API to analyze even a subset of this would've been painful imo.

1 more reply

carpo1mo ago· 4 in thread

This is great. I wish I had enough ram for a local model. I just spent the last few weeks writing something very similar, but I made it a local Electron app with Whisper, ffmpeg and I added semantic search and embeddings for chatting with the videos. It talks to Claude for the vision analysis, tagging and video chat. Do you only send one image for yours? I used a customised scene detection algorithm to find multiple different images per video and then send them all in one request to Claude (along with the subtitles). It's definitely the most expensive part. Using Sonnet 4.6 for the analysis and Haiku for the tagging costs about $1 for an hour of footage, I can imagine it would be slow locally.

nl1mo ago

Try some of the models on OpenRouter if you are looking to save money. Gemma 4 31B is $0.12/M input, $0.37/M output vs $1/M input, $5/M output for Haiku.

There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).

https://openrouter.ai/google/gemma-4-31b-it

carpo1mo ago

Cheers, I'll give it a try. How are those models at returning structured results? When I was writing the prompts for the analysis step and testing with older Claude models, it would have trouble structuring the XML consistently. Sonnet 4.6 handles it really well.

1 more reply

asennaOP1mo ago

Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.

But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.

Could you share how your scene-detection picks the frames?

---

For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.

carpo1mo ago

I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.

You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.

Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.

throwa3562621mo ago· 4 in thread

I ran Gemma on a 2015 thinkpad to do something similar. Fortunately, I could upgrade the memory otherwise it would have been a painful exercise.

Not gonna lie, llama.cpp had the fans spinning at max speed. But it worked and I got the job done.

iMerNibor1mo ago

> the fans spinning at max speed

This always confuses me - don't people want their computations to run as fast as possible and thus inevitably produce more heat that needs to be vented?

I suppose sometimes it is just an analogy for "its utilizing 100% of my resources" (which I'm guessing it is here), but I've definitely had people say it as an actual complaint in different contexts

dist-epoch1mo ago

What people complain is when they visit a blog with two images and the fans are spinning at max speed because the blog has 100 trackers.

overfeed1mo ago

> I've definitely had people say it as an actual complaint in different contexts

I think fan loudness is an outgrowth of conspicuous consumption because a certain OEM decided to make it a marketing bullet-point.

I was equally disappointed by by people - especially device reviewers - banging on the drum that phones made of plastic "didn't feel premium", and we got phones with glass backs that have to be shoved into plastic cases (because plastic is the near-perfect material to protect fragile phones screens and innards)

1 more reply

0xbadcafebee1mo ago

Fans shouldn't be running at max speed if the model fits in RAM with room to spare for context. Usually fans max out when the model doesn't fit and the CPU is chugging to make up the difference (or the user didn't tune LLM settings)

1 more reply

dwa35921mo ago· 3 in thread

did you know that this existed and is pretty good and doesn't hog 50GB of swap?

https://github.com/iliashad/edit-mind

iliashad1mo ago

Awesome, thank you for mentioning my project. Much appreciated

dwa35921mo ago

When I saw the original article - i thought this should totally be possible without needing that much ram, so let me just build it. So I started looking at the models I needed for each (whisper, florence etc). Before building, i decided to do one more search on the internet, and I found your project. That's exactly what I was going to build. Good work!!

1 more reply

echion1mo ago

Thanks for that link -- deserves more attention.

andai1mo ago· 2 in thread

Awesome. Say, this is very comprehensive.

I was vaguely aware of all these pieces existing (except for running a facial recognition database at home o_o), but it's really neat to put them all together like that.

asennaOP1mo ago

Thanks! I was honestly casually trying it out on the side with Claude's help. And I was actually pleasantly surprised to see how good the result was.

Still blows my mind I can do all this from my 2021 MBP.

I'll try to do a post once I have the next steps working (helping with planning and editing videos with Davinci Resolve).

ahknight1mo ago

I also have a 64GB M1 Max and am similarly impressed with what that workhorse can do. The M5 tempted me -- a lot -- but then I looked at what I was already getting done on that machine and just couldn't justify it ... yet. Someday, surely, but not yet. Gemma4 gave all my local projects new life, just like what you did here.

Great job. Long live the M1 Max!

1 more reply

gitowiec1mo ago· 2 in thread

Reading this text feels strange, sentences seems to be detached

cataphract1mo ago

I had exactly the same impression, and I recall seeing this style other times recently. First time I thought it was just bad writing skills, now I'm thinking it's AI generated.

asennaOP1mo ago

I'm the author, yes it is AI-assisted.

You can make AI-generated content without it being slop. Slop, to me at least, is content that's wrong, padded, or generic.

I see the cadence / short-sentence issues but if there's something else beyond those, I'd actually want to know what made it feel bad.

I would've put off documenting what I did over the weekend but instead, I did document everything, spent quite some time (several iterations) and effort to make sure it does not hallucinate and writes in my own tone and voice. I'm sure it could be better but the content is not made-up.

At a time where most of us software engineers have changed our workflows to let AI write 80+% of our code using agents, I feel writing is heading the same way. It then becomes a matter of taste, whether it's done well or not.

If you're looking clues and signs for whether a content has used AI, you're going to be disappointed over the next 12 months.

If it feels jarring right now, I'll work harder on the workflow so it feels more natural next time (someone shared this project with me - https://github.com/blader/humanizer).

But this clearly allows me to make content which I wouldn't have done earlier.

2 more replies

cold_harbor1mo ago· 2 in thread

the reason 50GB swap is even viable here is Apple Silicon's memory bandwidth. on x86 that much swap would make inference unusably slow

throwawaytea1mo ago

Memory bandwidth or storage bandwidth?

bahmboo1mo ago

potato potatoh

zazibar1mo ago· 2 in thread

The subject matter is interesting but the amount of slop makes it difficult to read through. Yeah, it's great that you can throw your technical problems at Claude without caring much about the generated output but treating your own writing that you actually want to share with the world the same way is a terrible idea.

asennaOP1mo ago

Tbh, I did spend a lot of time trying to ground it and de-slopify it - verified nothing was halucinated and went through 10 iterations to get to this. It's almost like wrestling with Claude and I knew it would be tough on HN.

But because of the fear of non-perfection, I used to put away things like creating this article or even posting it anywhere. And I do think the article has real value that HN would appreciate (I am myself an HN-enthusiast).

I'll try more. Someone else shared this project which would be really helpful - https://github.com/blader/humanizer

Also a side note, the blog is posted on my self-created Slopit.io platform which is purely meant for your personal agents (working along with you) to post content - I recommend trying it out. https://blog.slopit.io/this-blog-post-is-slop/

I know, things are getting difficult with all the slop around, but my personal opinion is, as the agents get better at writing, the "annoying-ness" factor reduces and pieces of substance will still be appreciated, even if it was written by agents. This and the fact that agents aren't going away.

If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.

PS - this comment was 100% hand-typed.

teach1mo ago

For what it's worth, I really enjoyed this read and almost came here to comment "this is the most enjoyable llm-assisted article I've read in a while"

The tells were unmistakable but it still had a human touch, so I for one am glad you published anyway.

1 more reply

oceanus1mo ago· 2 in thread

[flagged]

dang1mo ago

Could you please not post generated comments to HN? It's not allowed here. See https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079.

We ban accounts that do this and I don't want to ban you, so please write everything that you post to HN by hand.

Of course, it's impossible to know for sure what was LLM processed or not, but we're getting complaints about some of your posts and, upon inspection, the complaints seem justified.

genxy1mo ago

The article itself has many AI tells. Can we update the guidelines on AI generated content ?

1 more reply

Confiks1mo ago· 1 in thread

I'm not quite sure why all that swapping is necessary. I really does age your SSD quite fast considering the enormous memory bandwidth required. Gemma 4 31B at 4-bit quantization should only be around 19 GiB [1], not 28.4 GiB. I'm not feeding it images regularly, so I'm not sure how much memory it needs to get those into context, but I can't imagine it is more than 10 GiB.

The activity monitor does show all kinds of Electron apps active, on top of a presumably model-loaded Handy and a virtual machine for Claude Code, so I guess that's the real root cause for all the swapping. If your laptop starts trashing I can't imagine you have any use for those apps, which will grind to a halt.

[1] https://huggingface.co/mlx-community/gemma-4-31b-it-4bit

asennaOP1mo ago

Yeah to be fair, I could've cleaned everything up but this was taken when I was doing other work on my laptop while the screenshot was taken.

Although slightly laggy, I was impressed by the fact that I was still able to work on other things and have a bunch of tabs open on my Brave browser.

herf1mo ago· 1 in thread

Two questions:

1. What is the search index?

2. The "description.md" example has things like "faces -> cluster_id". Is this from Davinci Resolve's face index? Things like faces+names and locations are really important with photo collections, but general LLMs don't handle them so well.

asennaOP1mo ago

1) It's just simple plain-text `.description.md` sidecar files, one per clip, sitting next to each video.

Something which I can query later - Like when brainstorming with Claude "I wanna make some videos of the Luxury rooms in the lodge" and it knows what all videos could help here (going through the files).

There's also a folder root level files that aggregates the text descriptions to make it easier to find.

I've just attached an image in the blog showing an example - https://blog.simbastack.com/_media/gvcycx2n.png

2) No - nothing from DaVinci Resolve. Framedex is a standalone pipeline. Resolve isn't involved.

Faces come from insightface (the open-source buffalo_l pack - RetinaFace for detection), running locally on CPU. For each clip it detects faces in the sampled frames, embeds them, and writes rows to ~/.framedex/faces.db.

Tbh, this part I know it's building up in my local DB but I haven't tested how good is it. Will check them out properly soon.

But yeah, on your broader point that's why framedex deliberately does not ask the LLM to handle faces or locations.

----

Faces → insightface / ArcFace embeddings. Deterministic, comparable across clips. The vision model only contributes a rough people_count; it never tries to identify anyone.

Locations → EXIF GPS via exiftool, reverse-geocoded through Nominatim/OpenStreetMap. Hard metadata, not a guess.

The LLM only does what it's good at: scene description, mood, shot type, keywords, keep/review/cull rating (this last part is also debatable though).

egorfine1mo ago· 1 in thread

> generative AI video has no place on a real travel brand

I am pretty sure that the vast majority of Airbnb hosts would not agree with you.

> equals TripAdvisor crucifixion

I have no idea how the Airbnb hosts with fake listings survive, really.

asennaOP1mo ago

Haha. It's honestly something that I've been struggling with myself. I'm running this safari lodge but I don't want to go down that route of slop videos!

But on the other hand, genuine videos do take time and slows down the process.

genxy1mo ago· 1 in thread

Why did you destroy your own voice to have it replaced by AI ?

mainaisakyuhoon1mo ago

I really struggled to read the AI slop in this.

clueless1mo ago· 1 in thread

This sounds like a great capability to be added to immich

asixicle1mo ago

Or Stash lol

pavlov1mo ago· 1 in thread

The content is good, but this LLM writing style gets tiresome. Everything is a revelation:

>“I bought it for Chrome. It's running a model that didn't exist when I bought it.”

Well duh, personal computers run new software. That’s literally the whole point. The Apple II didn’t sell on the strength of the preinstalled apps.

asennaOP1mo ago

Author here. I totally hear you. I wasn't expecting this to do well on HN for exactly this reason.

But I've mentioned elsewhere - if it wasn't for all the AI-assistance, I would've put-off documenting everything that I did and not even get to the writing part.

But yeah, I'll be working on the workflow to make the next write-up better, more humanized.

brcmthrowaway1mo ago· 1 in thread

So do they run the lodge or what?

asennaOP1mo ago

Hi. I wrote this article - yes, I do run a safari lodge in Maasai Mara, Kenya. It's amazing. Ask me anything if you're interested in knowing more.

(Also email is in my profile).

asennaOP1mo ago

UPDATE: Quickly created a repo for this - https://github.com/Simbastack-hq/framedex (MIT License)

It's not tested properly after I genericized it. Will try to go through it properly and add more updates.

Two big things on my TODO: 1) Make use of this indexing and using Claude's help, make video editing faster with Davinci Resolve (now that I have a good index of all the content)

2) I currently did this for videos, but I want to add more things to this for my thousands of still images of my camera - need to make sense of them. So I'll be working on this as well.

harlanji1mo ago

Interesting. I've been doing similar stuff with my archive on a weak Celeron laptop with 4GB RAM using vanilla ML tech that I'm learning by prompting LLMs (heh). Extract all info from media as sidecar files and all, exploring low power approaches.

I can sell this as a service to people who can't even run an LLM, or don't want to cook their hardware.

Waitlist open:

"Catalog, search, preview, and generate production-ready prompts & scripts from your entire archive — on your existing hardware. Then render in the cloud."

https://harlanji.pythonanywhere.com/assetforge/

benbojangles1mo ago

Gemma4 because presumably it does image analysis right?

-31b It's a dense model

-how many tokens/s is it running at

-What temps are the M1 max GPU/CPU running at

-Is it mlx or gguf

-Why 31b and not 26b which is moe and much more efficient on the m1 max at 50tokens/s & low temps.

I personally use (MLX) qwen3.6-35b-8bit mostly, but use Gemma-4-26b-4bit for image analysis, its mind blowing how fast it is at identifying the scene in a photograph.

moinism1mo ago

> Every AI video editor on the market assumes your footage is already labeled

Shameless plug: I'm the founder of Chat Octopus, an AI media assistant, and it actually 'looks' at the videos to understand them before creating a cut.

edg50001mo ago

Love this article! Had never thought of a use case like this. Had no idea Gemma had a vision encoder. Great use case for local LLM!

ngai_aku1mo ago

I’d like to do something like this for the collection of home videos I have piling up, but I’m still on 16GB M1. Any hope of getting decent results with smaller models? If not, does anyone have tips on GPU rental?

I have a Claude max sub and plenty of OpenRouter credit, but I don’t feel good about uploading my family’s private videos

mujib771mo ago

This is sick. Nice work

coldtea1mo ago

The post is a mix of human and AI writing and the AI-mannerisms get on the nerves. At least it has a clear topic and some actionable insights and code examples.

yardie1mo ago

Now I have another project for this weekend! I also have tons of video and not a lot of time to index them.

j / k navigate · click thread line to collapse

142 comments

79 comments · 27 top-level

egorfine1mo ago· 12 in thread

Thanks for the article! I have a beefy M5 Pro and I'm eagerly looking around for ways to use local models (specifically Gemma4 & Qwen3.6).

This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.

satvikpendem1mo ago

[0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide

egorfine1mo ago

I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.

1 more reply

mft_1mo ago

1 more reply

asennaOP1mo ago

Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.

You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!

But yeah, I think I'll try to do some more of these experiments soon.

endymi0n1mo ago

—-

Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”

1 more reply

busfahrer1mo ago

ahknight1mo ago

I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE

embedding-shape1mo ago

1 more reply

egorfine1mo ago

Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.

Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.

Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.

1 more reply

egorfine1mo ago

Qwen3.6 27B oQ6: 12.5 t/s generation, 340-360 t/s pp.

egorfine1mo ago

Native MCP:

For Qwen 35B enabling native MCP on MLX models slows it down by 10%.

For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.

(all tested on M5 pro).

1 more reply

juancn1mo ago

I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server

1 more reply

desro1mo ago· 6 in thread

When your Claude wrote this post they might not have selected the right URL to share, unless your home folder is exposed. Care to share the skill files?

embedding-shape1mo ago

We just got a modern example of the classic message from a friend who just picked up programming, containing: "I just created my own web app, wanna check it out? It's here: http://localhost:8080"

0x38B1mo ago

1: To get an Android app working that has been delisted and requires a 'key' app that you purchase. We did purchase it, but didn't think to make any backups.

m4631mo ago

reminds me of telling a friend:

I hacked your system: file:///etc/passwd

1 more reply

z21mo ago

I've been getting this weekly from colleagues. It's very much an epidemic right now! And the port number is indeed almost always a random number between 8000 and 8100.

2 more replies

asennaOP1mo ago

Oops! My bad. Fixing it now. And yeah, I can share the Skill file. Give me 5 mins.

asennaOP1mo ago

Ok I scrambled to finalize a name for it and create a new repo for it - https://github.com/Simbastack-hq/framedex

PS - I just put this together in the last few mins, removed my personal files and references. So it's not tested properly, please let me know if any issues.

It's still an early hack, but I have thousands of still images as well from my camera which I've not processed and I need to do the same analysis for those.

So I'll continue working on it, but happy to receive any PRs if anyone finds any use for it.

I'm tired of having a backlog of thousands of images and videos, leaving it for later.

2 more replies

theodorewiles1mo ago· 6 in thread

My take is that B2C AI applications are kind of structurally limited by how hard it is to build personalized context.

The idea of capable local models could be a huge unlock here if they are able to do the bottom-up context collection research / tagging / etc. at scale.

michaelbuckbee1mo ago

I made a B2C AI app that's fully local (and free) to do AI based contextual file renaming.

So if you give it a bunch of screenshots it will try and intelligently name them based upon what is in the screenshot. Same for videos, PDFs, etc.

But to your point I haven't even tried charging money as it feels like something Apple is just going to bake in as a feature.

https://finalfinalreallyfinaluntitleddocumentv3.com/

asennaOP1mo ago

This is cool. And yeah love the name!

Are you planning to open source it? Or maintain it in the future?

1 more reply

ntcho1mo ago

absolutely love the domain here. great taste

asennaOP1mo ago

Definitely agree with this. Here, me and Claude brainstorming together did that Research, and some trial-and-error to get to this.

But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.

enos_feedler1mo ago

asennaOP1mo ago

A huge thing here was the massive amount of data that was just processed - I went through about 1TB of files over 24 hours.

Using API to analyze even a subset of this would've been painful imo.

1 more reply

carpo1mo ago· 4 in thread

nl1mo ago

Try some of the models on OpenRouter if you are looking to save money. Gemma 4 31B is $0.12/M input, $0.37/M output vs $1/M input, $5/M output for Haiku.

There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).

https://openrouter.ai/google/gemma-4-31b-it

carpo1mo ago

1 more reply

asennaOP1mo ago

Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.

But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.

Could you share how your scene-detection picks the frames?

---

carpo1mo ago

You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.

throwa3562621mo ago· 4 in thread

I ran Gemma on a 2015 thinkpad to do something similar. Fortunately, I could upgrade the memory otherwise it would have been a painful exercise.

Not gonna lie, llama.cpp had the fans spinning at max speed. But it worked and I got the job done.

iMerNibor1mo ago

> the fans spinning at max speed

This always confuses me - don't people want their computations to run as fast as possible and thus inevitably produce more heat that needs to be vented?

I suppose sometimes it is just an analogy for "its utilizing 100% of my resources" (which I'm guessing it is here), but I've definitely had people say it as an actual complaint in different contexts

dist-epoch1mo ago

What people complain is when they visit a blog with two images and the fans are spinning at max speed because the blog has 100 trackers.

overfeed1mo ago

> I've definitely had people say it as an actual complaint in different contexts

I think fan loudness is an outgrowth of conspicuous consumption because a certain OEM decided to make it a marketing bullet-point.

1 more reply

0xbadcafebee1mo ago

1 more reply

dwa35921mo ago· 3 in thread

did you know that this existed and is pretty good and doesn't hog 50GB of swap?

https://github.com/iliashad/edit-mind

iliashad1mo ago

Awesome, thank you for mentioning my project. Much appreciated

dwa35921mo ago

1 more reply

echion1mo ago

Thanks for that link -- deserves more attention.

andai1mo ago· 2 in thread

Awesome. Say, this is very comprehensive.

I was vaguely aware of all these pieces existing (except for running a facial recognition database at home o_o), but it's really neat to put them all together like that.

asennaOP1mo ago

Thanks! I was honestly casually trying it out on the side with Claude's help. And I was actually pleasantly surprised to see how good the result was.

Still blows my mind I can do all this from my 2021 MBP.

I'll try to do a post once I have the next steps working (helping with planning and editing videos with Davinci Resolve).

ahknight1mo ago

Great job. Long live the M1 Max!

1 more reply

gitowiec1mo ago· 2 in thread

Reading this text feels strange, sentences seems to be detached

cataphract1mo ago

I had exactly the same impression, and I recall seeing this style other times recently. First time I thought it was just bad writing skills, now I'm thinking it's AI generated.

asennaOP1mo ago

I'm the author, yes it is AI-assisted.

You can make AI-generated content without it being slop. Slop, to me at least, is content that's wrong, padded, or generic.

I see the cadence / short-sentence issues but if there's something else beyond those, I'd actually want to know what made it feel bad.

If you're looking clues and signs for whether a content has used AI, you're going to be disappointed over the next 12 months.

If it feels jarring right now, I'll work harder on the workflow so it feels more natural next time (someone shared this project with me - https://github.com/blader/humanizer).

But this clearly allows me to make content which I wouldn't have done earlier.

2 more replies

cold_harbor1mo ago· 2 in thread

the reason 50GB swap is even viable here is Apple Silicon's memory bandwidth. on x86 that much swap would make inference unusably slow

throwawaytea1mo ago

Memory bandwidth or storage bandwidth?

bahmboo1mo ago

potato potatoh

zazibar1mo ago· 2 in thread

asennaOP1mo ago

I'll try more. Someone else shared this project which would be really helpful - https://github.com/blader/humanizer

If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.

PS - this comment was 100% hand-typed.

teach1mo ago

For what it's worth, I really enjoyed this read and almost came here to comment "this is the most enjoyable llm-assisted article I've read in a while"

The tells were unmistakable but it still had a human touch, so I for one am glad you published anyway.

1 more reply

oceanus1mo ago· 2 in thread

[flagged]

dang1mo ago

Could you please not post generated comments to HN? It's not allowed here. See https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079.

We ban accounts that do this and I don't want to ban you, so please write everything that you post to HN by hand.

Of course, it's impossible to know for sure what was LLM processed or not, but we're getting complaints about some of your posts and, upon inspection, the complaints seem justified.

genxy1mo ago

The article itself has many AI tells. Can we update the guidelines on AI generated content ?

1 more reply

Confiks1mo ago· 1 in thread

[1] https://huggingface.co/mlx-community/gemma-4-31b-it-4bit

asennaOP1mo ago

Yeah to be fair, I could've cleaned everything up but this was taken when I was doing other work on my laptop while the screenshot was taken.

Although slightly laggy, I was impressed by the fact that I was still able to work on other things and have a bunch of tabs open on my Brave browser.

herf1mo ago· 1 in thread

Two questions:

1. What is the search index?

asennaOP1mo ago

1) It's just simple plain-text `.description.md` sidecar files, one per clip, sitting next to each video.

There's also a folder root level files that aggregates the text descriptions to make it easier to find.

I've just attached an image in the blog showing an example - https://blog.simbastack.com/_media/gvcycx2n.png

2) No - nothing from DaVinci Resolve. Framedex is a standalone pipeline. Resolve isn't involved.

Tbh, this part I know it's building up in my local DB but I haven't tested how good is it. Will check them out properly soon.

But yeah, on your broader point that's why framedex deliberately does not ask the LLM to handle faces or locations.

----

Faces → insightface / ArcFace embeddings. Deterministic, comparable across clips. The vision model only contributes a rough people_count; it never tries to identify anyone.

Locations → EXIF GPS via exiftool, reverse-geocoded through Nominatim/OpenStreetMap. Hard metadata, not a guess.

The LLM only does what it's good at: scene description, mood, shot type, keywords, keep/review/cull rating (this last part is also debatable though).

egorfine1mo ago· 1 in thread

> generative AI video has no place on a real travel brand

I am pretty sure that the vast majority of Airbnb hosts would not agree with you.

> equals TripAdvisor crucifixion

I have no idea how the Airbnb hosts with fake listings survive, really.

asennaOP1mo ago

Haha. It's honestly something that I've been struggling with myself. I'm running this safari lodge but I don't want to go down that route of slop videos!

But on the other hand, genuine videos do take time and slows down the process.

genxy1mo ago· 1 in thread

Why did you destroy your own voice to have it replaced by AI ?

mainaisakyuhoon1mo ago

I really struggled to read the AI slop in this.

clueless1mo ago· 1 in thread

This sounds like a great capability to be added to immich

asixicle1mo ago

Or Stash lol

pavlov1mo ago· 1 in thread

The content is good, but this LLM writing style gets tiresome. Everything is a revelation:

>“I bought it for Chrome. It's running a model that didn't exist when I bought it.”

Well duh, personal computers run new software. That’s literally the whole point. The Apple II didn’t sell on the strength of the preinstalled apps.

asennaOP1mo ago

Author here. I totally hear you. I wasn't expecting this to do well on HN for exactly this reason.

But I've mentioned elsewhere - if it wasn't for all the AI-assistance, I would've put-off documenting everything that I did and not even get to the writing part.

But yeah, I'll be working on the workflow to make the next write-up better, more humanized.

brcmthrowaway1mo ago· 1 in thread

So do they run the lodge or what?

asennaOP1mo ago

Hi. I wrote this article - yes, I do run a safari lodge in Maasai Mara, Kenya. It's amazing. Ask me anything if you're interested in knowing more.

(Also email is in my profile).

asennaOP1mo ago

UPDATE: Quickly created a repo for this - https://github.com/Simbastack-hq/framedex (MIT License)

It's not tested properly after I genericized it. Will try to go through it properly and add more updates.

Two big things on my TODO: 1) Make use of this indexing and using Claude's help, make video editing faster with Davinci Resolve (now that I have a good index of all the content)

2) I currently did this for videos, but I want to add more things to this for my thousands of still images of my camera - need to make sense of them. So I'll be working on this as well.

harlanji1mo ago

I can sell this as a service to people who can't even run an LLM, or don't want to cook their hardware.

Waitlist open:

"Catalog, search, preview, and generate production-ready prompts & scripts from your entire archive — on your existing hardware. Then render in the cloud."

https://harlanji.pythonanywhere.com/assetforge/

benbojangles1mo ago

Gemma4 because presumably it does image analysis right?

-31b It's a dense model

-how many tokens/s is it running at

-What temps are the M1 max GPU/CPU running at

-Is it mlx or gguf

-Why 31b and not 26b which is moe and much more efficient on the m1 max at 50tokens/s & low temps.

I personally use (MLX) qwen3.6-35b-8bit mostly, but use Gemma-4-26b-4bit for image analysis, its mind blowing how fast it is at identifying the scene in a photograph.

moinism1mo ago

> Every AI video editor on the market assumes your footage is already labeled

Shameless plug: I'm the founder of Chat Octopus, an AI media assistant, and it actually 'looks' at the videos to understand them before creating a cut.

edg50001mo ago

Love this article! Had never thought of a use case like this. Had no idea Gemma had a vision encoder. Great use case for local LLM!

ngai_aku1mo ago

I have a Claude max sub and plenty of OpenRouter credit, but I don’t feel good about uploading my family’s private videos

mujib771mo ago

This is sick. Nice work

coldtea1mo ago

The post is a mix of human and AI writing and the AI-mannerisms get on the nerves. At least it has a clear topic and some actionable insights and code examples.

yardie1mo ago

Now I have another project for this weekend! I also have tons of video and not a lot of time to index them.

j / k navigate · click thread line to collapse