Also co-founded Hashnode (a prominent dev publishing platform with 5M MAU).
1. ChatGPT desktop. This is the one I probably use the most without even thinking about it.
2. Gemini. I used 2.5 a lot for content stuff and recently started trying Gemini 3. The results are actually getting better now.
3. Codex in VS Code. Switched from Copilot about three months ago. Codex feels way better for me. The cloud task execution is wild, it can build an entire Next.js feature in one prompt.
4. ChatGPT on mobile. Still my go to for quick grammar or English fixes. I keep forgetting Apple’s AI even exists.
5. Superhuman's AI. I use it here and there to clean up replies. Nice to have but not something I’d miss if I stopped paying.
What about you?
Teams are smaller, release cycles are tighter, and AI is sneaking into a lot of workflows.
I’m curious what people here are actually relying on these days to keep things from breaking:
- What layers are in your stack? (types/linters, unit, contract, integration, E2E, monitoring, flags, SLOs, etc.)
- Is AI playing a real role yet for you? test gen, self-healing, triage, anomaly detection?
- Anything you dropped recently because it wasn’t worth the effort? (flaky UI tests, snapshot tests, staging envs…)
- For smaller teams, do you still bother with classic QA, or do you lean more on flags/observability/canaries?
- Anyone tried managed or AI-assisted QA instead of DIY? Curious if it actually worked, esp. around trust/cost/lock-in.
- How do you measure “confidence to release” beyond code coverage?
Would love to hear quick snapshots like: - team size / release cadence
- stack (web, mobile, regulated or not)
- pre-merge checks
- post-deploy safeguards
- tools you kept vs abandoned
- biggest source of flakiness right now
- what you’d do differently if starting today
Looking for real, on-the-ground stories from folks shipping in 2025. What’s working for you?
I built Mail42 (https://mail42.ai) to make it easier to test email flows without dealing with regex or messy parsing.
The pain point: whenever I tested signup or checkout flows, I had to create throwaway inboxes and then write hacky regex just to grab a 6-digit OTP or a verification link.
With Mail42 you can:
- Generate disposable email addresses instantly for QA
- Query emails using natural language like "get the OTP" or "find the verification link"
- Use a simple REST API that works with curl, Postman, or your test suite
Example:
`curl "https://get.mail42.ai/?email=test.123@mail42.ai&prompt=get otp"`
Response:
847291
It’s lightweight and intended for testing env
I’d love your feedback:
- would you use this in your testing workflow?
- what features or integrations would make it more useful?
- are there cases where regex still feels more reliable?
Thanks for checking it out.
What we've evaluated:
- OpenAI's Evals framework: Works well for benchmarking but challenging for custom use cases. Configuration through YAML files can be complex and extending functionality requires diving deep into their codebase. Primarily designed for batch processing rather than real-time monitoring.
- LangSmith: Strong tracing capabilities but eval features feel secondary to their observability focus. Pricing starts at $0.50 per 1k traces after the free tier, which adds up quickly with high volume. UI can be slow with larger datasets.
- Weights & Biases: Powerful platform but designed primarily for traditional ML experiment tracking. Setup is complex and requires significant ML expertise. Our product team struggles to use it effectively.
- Humanloop: Clean interface focused on prompt versioning with basic evaluation capabilities. Limited eval types available and pricing is steep for the feature set.
- Braintrust: Interesting approach to evaluation but feels like an early-stage product. Documentation is sparse and integration options are limited.
What we actually need: - Real-time eval monitoring (not just batch) - Custom eval functions that don't require PhD-level setup - Human-in-the-loop workflows for subjective tasks - Cost tracking per model/prompt - Integration with our existing observability stack - Something our product team can actually use
Current solution:
Custom scripts + monitoring dashboards for basic metrics. Weekly manual reviews in spreadsheets. It works but doesn't scale and we miss edge cases.
Has anyone found tools that handle production LLM evaluation well? Are we expecting too much or is the tooling genuinely immature? Especially interested in hearing from teams without dedicated ML engineers.
I'll start. Jira. We all use it, we all hate it, nobody admits how much time we waste updating tickets.
Did you move it to the right column? Story points aren't filled out. Link it to the epic.
Meanwhile the actual work takes 2 hours, documenting it takes another hour.
Half the team ignores it, the other half are obsessed with workflows that have 47 different statuses. But try suggesting GitHub issues and suddenly "how will we track velocity??"
What tool is supposed to make you productive but just creates busywork?
Current stack: - Next.js on Vercel - Serverless functions for AI/LLM endpoints - Pinecone for vector storage
Questions for those running AI in production:
1. What's your serverless infrastructure choice? (Vercel/Cloud Run/Lambda)
2. How are you handling state management for long-running agent tasks?
3. What's your approach to cost optimization with LLM API calls?
4. Are you self-hosting any components?
5. How are you handling vector store scaling?
Particularly interested in hearing from teams who've scaled beyond prototype stage. Have you hit any unexpected limitations with serverless for AI workloads?
I’m not sure if this post will stick (we’re competing with some YC-backed startups in this space), but I’ll give it a try!
We built Docs by Hashnode because we saw a gap in current documentation tools. Many platforms are either too rigid, lack customization (Like ReadMe), or require too much dev time to manage (Like Docusaurus). With our docs product, we wanted to create something that’s both flexible and scalable, allowing teams to focus on creating docs without the complexity.
Here’s what we’re solving:
1. Most doc platforms don’t scale well or offer customization.
2. Teams often struggle with rigid templates and version control.
3. Companies need API references and product guides that grow with their product, not against it.
Our solution:
1. Offer both Hosted and Headless mode with GraphQL support for those who need more control.
2.Real-time collaboration with inline commenting, perfect for technical and non-technical teams to work together.
3. AI-powered search for faster, smarter discovery of docs content.
4. Unlimited API references and guides to help you scale your docs as your product evolves.
5. Create API references easily using OpenAPI specs.
6. Blazing-fast performance optimized for SEO and Lighthouse scores.
Our goal is to build documentation that evolves with your product, not something that slows you down. Some early users, including YC startups, are already using it and love the flexibility.
Would love to get your feedback or answer any questions!
Thanks for reading! Excited to hear what the community here thinks.