I decided to build my own tool that needs to have 3 important things: can transcribe videos, analyse video frames, and everything needs to be done locally.
I don't wanna deal with storing my videos in the cloud because of two concerns: privacy and storage cost.
I've been working for the last couple of months. I have a source available version that can be used for free (personal and commercial use with companies that have fewer than 5 people). Available here (https://github.com/IliasHad/edit-mind), and the project has 1.3k Github stars
Now, I'm building a desktop app with direct NLE integration (Final Cut Pro, DaVinci Resolve, and Adobe Premiere Pro). This includes an editing agent that understands your footage and your editing style. (https://edit-mind.com)
Demo Video: https://youtu.be/jcctyfVg_34
I tried the Google Video Intelligence API. Got a $400 bill for 4 videos (5 minutes average, 4k videos) of analysis (doesn't include video transcription), and I used my GCP startup credits to cover the bill.
I decided to build my own tool that needs to have 3 important things: can transcribe videos, analyse video frames, and everything needs to be done locally.
I don't wanna deal with storing my videos in the cloud because of two concerns: privacy and storage cost.
I've been working for the last couple of months. I have a source available version that can be used for free (personal and commercial use with companies that have fewer than 5 people). Available here (https://github.com/IliasHad/edit-mind), and the project has 1.3k Github stars
Now, I'm building a desktop app with direct NLE integration (Final Cut Pro, DaVinci Resolve, and Adobe Premiere Pro). This includes an editing agent that understands your footage and your editing style.
Preview: https://youtu.be/jcctyfVg_34
Happy to answer questions and hear your feedback.
I decided to build Edit Mind, which started as a simple CLI that would transcribe videos using OpenAI Whisper and search across them using text, nothing fancy.
Then I decided to add frame analysis. I built a Python script that would handle that, take the full video, divide it into smaller parts that would be 1 second to 2 seconds long, pass the frame to another system that would recognize faces, objects, text over image, etc.
After that, I decided to build an Electron desktop app that would manage the UI and have a search and chat feature.
Then, I was like let's open source it and share it with the community over Reddit. People loved it (https://www.reddit.com/r/selfhosted/comments/1ogis3j/i_built...). Many of them requested Docker integration. I decided to focus on that instead, which was a great idea and suggestion. (https://www.youtube.com/watch?v=YrVaJ33qmtg&t=12s)
Now, we have 3 Docker containers: one responsible for the web UI, one for the background jobs , media stuff and local vector integration, one for the ML service (transcription and frame analysis) that will be a Python script communicating via WebSocket.
After posting over X and tagging Twelve Labs for inspiring me to add new features and UI enhancements to the project, I had the opportunity to present the project live at their webinar series (if you wanna check Edit Mind in a live demo: https://www.youtube.com/watch?v=k_aesDa3sFw&t=1271s)
After getting the proof of concept with all the features that I was hoping to get from Edit Mind, it's time to focus on improving the code quality, refactoring, and implementing best practices. Remember that I'm working solo with the help of external contributors.
I would love to get your feedback about the project.
I have 2TB+ of personal video footage. Finding specific moments was impossible, like searching PDFs by content, but for video. Google's Video Intelligence API worked, but it cost me $450+ for just a few videos, and I'd have to upload all my personal footage to their cloud.
So I built Edit Mind to do it locally.
The core problem:
You can search PDFs by their content in Finder (Mac OS). Why can't we do the same with videos? Every cloud solution either costs a fortune at scale or requires uploading your personal footage to cloud storage but I don’t wanna have my personal raw videos uploaded to the cloud.
How it works:
- Everything runs on your machine (your raw videos never leave your computer) - Indexes videos once: transcribes audio, detects objects (YOLO), recognizes faces, analyzes emotions - Stores metadata in ChromaDB (local vector database) - Natural language gets parsed into structured queries, then semantic search finds matches (uses Gemini API currently, but can swap in a local LLM like Ollama if you prefer)
What it actually does:
1. Type: "scenes where @Ilias is looking happy, eating a pizza" 2. Behind the scenes it converts this to: {"faces":["Ilias"], "emotions": ["happy"], "objects": ["pizza"]} 3. Then searches your local vector database for matching scenes across 2TB of footage in seconds.
Real cost comparison:
* GCP Video Intelligence API: ~$0.10/minute of video (https://cloud.google.com/video-intelligence/pricing) = $100+ for 200 (5 minutes) videos (I have over 3000 videos), this will cost me over 1500+$ for getting the videos analyzed * Edit Mind: Free after initial setup, runs on your own hardware (you need to have good one to handle this heavy video processing tasks)
Technical choices:
Built with Electron because I needed real filesystem access and didn't want browser storage limits. Python backend handles the heavy ML work (face_recognition, YOLOv8, FER for emotions) and communicate via web sockets. The analysis pipeline is plugin-based – took me less than a day to add color dominate as a separate plugin.
Current limitations:
* Needs decent hardware (GPU recommended but not required) * Face recognition requires some manual training (adding known faces) * UI is functional but could be prettier * Query parsing uses Gemini API by default (but you can configure it to use local alternatives)
Why I'm sharing this:
I can't be the only person with this problem. Videographers, parents with years of family footage, documentary filmmakers, anyone with large video libraries. The code is MIT licensed and designed to be extended.
Would love feedback on:
1. Is $450+ for cloud video analysis a common pain point, or am I an outlier? 2. What other analyzers would be useful? (thinking about camera movement analysis and scene type classification like POV/vlog/interview) 3. Should I prioritize making it 100% offline by default, or is the Gemini API option fine for most use cases?
GitHub: https://github.com/iliashad/edit-mind
Demo: https://youtu.be/Ky9v85Mk6aY
Built this over a few weekends because I was tired of paying Google to search my own videos. Happy to discuss architecture decisions or the ML pipeline!
Some key takeaways from our conversation include:
Course Creation: Wes emphasized the importance of project-based learning, sharing how his teaching style got people to buy his online courses.
AI Integration: We discussed how you can leverage AI tools to help and assist you as a developer, and they're developing so fast
Syntax.Fm: The back story of how Wes and Scott built Syntax.fm podcast
I'm curious to hear your thoughts: How have you approached creating educational content for developers? What challenges and successes have you experienced?
For those interested, here's the full conversation: https://youtu.be/wqKk4TsVY8M