Jet_Xu | Better HN

Jet_Xu

54 karmaJoined June 4, 202462 submissions

Systemizing Vibe Coding through AI Harnessing Engineering. Creator of DocMason an open-source repo-native agent app for analyst-grade answers over complex private files. The repo is the app. Codex is the runtime.

https://github.com/JetXu-LLM/DocMason

Recent submissions

Strong feeling: we are in a folded AI reality

Some people think Agentic AI could do everything, is getting more and more powerful even feel fear about it.

Another group non-technical people still just trapped in the LLM chat is weak and full of hallucination world. Looks like AI is just a bubble and useless(a stronger search engine)

The main watershed is Agentic.

1Jet_Xu2mo ago1

Show HN: DocMason – Agent Knowledge Base for local complex office files (opens in new tab)

(github.com)GitHub

11Jet_Xu2mo ago0

Ask HN: Whether there is LLM or other strong NLP behind Hacker News?

I found that if your post is a LLM generated content, it will be auto-hide.

And I heard long time ago for same product you could not share in Show HN for multiple times.

What is the mechanism behind? Is Hacker News using LLM to judge each post or retrieve the product name and deduplicate in Show HN?

3Jet_Xu2mo ago1

Ask HN: The repo is the app. Codex is the runtime. Could this be future pattern?

I am developing a repo-native agent app for analyst-grade answers over complex private file.

The initial idea is: I want codex could help me to deep dive digest all my work files (tons of ppt, excel & doc), and then help me to design IT architecture, prepare slides outline for top management, or write a full report for company own huge IT platform for multi purpose (on-boarding or showcase)

I have developed an auto sync. Knowledge base in Codex and a set of skills + scripts. Then I realized, this could be a repo with an AGENTS.md to manage all pipelines.

Now all you need to do is download the repo, open it in Codex, and start chatting with your private working files.

I call this repo pattern is: The repo is the app. Codex is the runtime.

What do you think about this pattern? Maybe it could be the future :) due to we could fully leverage the tokens and AI Agent capability of Codex/Claude Code more than just coding.

1Jet_Xu2mo ago0

Are you drowning in AI code review noise? 70% of AI PR comments are useless

Most AI code review tools generate 10-20 comments per PR. The problem? 80% are noise. Here's a framework for measuring signal-to-noise ratio in code reviews - and why it matters more than you think.

https://jetxu-llm.github.io/posts/low-noise-code-review/

2Jet_Xu7mo ago0

Show HN: AI code reviewer that analyzes cross-module impacts

Hey HN! I'm Jet, building LlamaPReview.

Most AI code reviewers only see the diff. They miss the real risks: a function rename that breaks 5 other files, a dependency change that shifts your architecture, a "small" PR that actually rewrites your auth logic.

We built a context retrieval engine that pulls in related code from across the repo before analysis. On complex PRs, it auto-generates Mermaid diagrams showing cross-file impacts.

Technical challenges: 1. Deciding which files are "related" without analyzing the entire repo every time (we use code graphs + call chains + git history) 2. Fitting it into LLM context limits (we rank by relevance and truncate aggressively) 3. Auto-detecting when to trigger deep analysis vs fast review (~35% of PRs end up needing it)

It's live now, free for all public repos. For private repos, we're trying to figure out a sustainable model: first 3 PRs get deep analysis free, then you choose between our always-free basic tier or a paid tier with persistent code knowledge graphs.

The controversial part: to do deep analysis well, we need to build a knowledge graph of your codebase (classes, methods, call chains). For private repos, this means storing structural metadata. Some teams are fine with this. Others want zero-knowledge architecture where even metadata is encrypted client-side.

Questions for HN: 1. What signals would YOU use to detect "this PR is complex enough to need deep analysis"? 2. Would you pay for code review tooling, and if so, what's your threshold? 3. Is storing structural metadata (no code content) acceptable, or is Zero-knowledge Storage the only way?

Site: https://jetxu-llm.github.io/LlamaPReview-site/

I'm here for the next few hours to answer technical questions or get roasted for my monetization strategy.

1Jet_Xu8mo ago0

Ask HN: DeepSeek V3's AI Code Review Performance – A Reality Check with Data

I recently conducted a detailed benchmark of various LLMs for AI code review, specifically focusing on Pull Request analysis. The results were quite surprising and contradict some recent marketing claims.

Test setup:

  Models tested: Mistral-Large-2411, Gemini 2.0 Flash (thinking-exp-1219), Mistral-Nemo-12B, DeepSeek V3, and "a ReAct AI Agent(also based on Mistral-Nemo-12B)" implementation
  
  Consistent testing environment: Same prompts, temperature settings, and max_tokens (8192 for DeepSeek)
  
  Test data: Real-world PRs from various open-source projects (all in English)
  
  Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency

Results (in order of performance):

  1. Gemini 2.0 Flash (thinking-exp-1219) - it has the best in depth code review result, but the output format could not fulfill requirement perfectly (compare with Mistral AI models)
  
  2. Mistral-Large-2411
  
  3. Mistral-Nemo-12B + ReAct AI Agent
  
  4. DeepSeek V3
  
  5. Mistral-Nemo-12B

Key findings:

  - Despite recent marketing claims, DeepSeek V3 only marginally outperformed a 12B model from July
  
  - The price-performance ratio is concerning, especially after their February 8th pricing changes
  
  - Larger parameter count (671B) didn't translate to better PR review quality

For transparency: I developed LlamaPReview (<https://jetxu-llm.github.io/LlamaPReview-site/>), a GitHub App for automated PR reviews, which I used its core code as the testing framework. The app is free and can help you reproduce PR review efforts.

Questions for the community:

  1. Has anyone else noticed similar performance gaps with DeepSeek V3?
  
  2. What metrics should we standardize for comparing LLM performance in specific tasks like code review?
  
  3. How much should marketing claims influence our technical evaluations?

Would love to hear your experiences and thoughts, especially from those who've tested multiple models in production environments.

2Jet_Xu1y ago1

Ask HN: AWS DynamoDB Triggers – A Time Bomb in Your Production Environment?

I recently discovered what I consider a serious design flaw in AWS DynamoDB Triggers that I believe deserves more attention from the community.

Here's the issue: DynamoDB Triggers can only point to the `$LATEST` version of a Lambda function. Yes, you read that right - there's no built-in way to target a specific version or alias through the console. This means any changes to your Lambda function's `$LATEST` version immediately affect your production triggers, whether you intended to or not.

Consider this scenario: 1. You have a critical DynamoDB table with a Lambda trigger handling important business logic 2. A developer pushes changes to the Lambda's `$LATEST` version for testing 3. Surprise! Those changes are now processing your production data

The workarounds are all suboptimal: - Create triggers through CloudFormation/CDK (requires delete and recreate) - Maintain separate tables for different environments - Add environment checks in your Lambda code - Use the Lambda console to configure triggers (unintuitive and error-prone)

This design choice seems to violate several fundamental principles: - Separation of concerns - Safe deployment practices - The principle of least surprise - AWS's own best practices for production workloads

What's particularly puzzling is that other AWS services (API Gateway, EventBridge, etc.) handle versioning and aliases perfectly well. Why is DynamoDB different?

Some questions for the community: 1. Has anyone else encountered production issues because of this? 2. What workarounds have you found effective? 3. Is there a technical limitation I'm missing that explains this design choice? 4. Should we push AWS to change this behavior?

For now, my team has implemented a multi-layer safety net: ```python def lambda_handler(event, context): if not is_production_alias(): log_and_alert("Non-production version processing production data!") return

    if not validate_deployment_state():
        return
        
    # Actual business logic here

```

But this feels like we're working around a problem that shouldn't exist in the first place.

Curious to hear others' experiences and thoughts on this. Have you encountered similar "gotchas" in AWS services that seem to go against cloud deployment best practices?

3Jet_Xu1y ago3

What do you check first in PR review? Help shape our AI Code review tool (opens in new tab)

(github.com)GitHub

1Jet_Xu1y ago1

Lessons Learned: Migrating to Mistral-Large-2411 for Production Code Reviews (opens in new tab)

(medium.com)

4Jet_Xu1y ago1

Show HN: LlamaPReview – AI code reviewer trusted by 2000 repos, 40%+ effective (opens in new tab)

(jetxu-llm.github.io)

2Jet_Xu1y ago0

Ask HN: Are We Approaching Code Reviews Wrong?

As a tech lead, I recently tracked my team's time and found we spent an average of 12.5 hours per developer per week on PR reviews(include discussion of PR review results) - more than half of our coding time! This led me to question our approach to code reviews.

Key challenges we usually face:

- Are we wasting senior developers' time? How do we balance thorough reviews with efficient use of expertise?

- Ensuring consistent review quality across large teams and time zones

- Conducting comprehensive reviews in small teams without slowing development

- Focusing on the right aspects: architecture, logic, or style

- Using PR reviews for knowledge sharing and mentorship effectively

--------------------------------

I've been experimenting with AI-assisted PR reviews using Graph RAG for context understanding. The idea was to build a knowledge graph of our codebase that could make LLM understand Code relationships and dependencies. Early results show promise in reducing initial review time and catching consistent patterns, but it raises new questions about the balance between AI assistance and human expertise.

I'd love to hear your thoughts:

  1. What's your team's current review process? What works well, and what doesn't?

  2. How do you measure the effectiveness of your review process?

  3. What kind of support do you think an ideal AI PR review tool should provide? Comprehensive analysis, focused insights, or specialized in certain aspects?

If you're curious about AI-Powered PR review tool, I've published one called LlamaPReview that we've been using. It's still in beta, but feel free to check it out: https://github.com/marketplace/llamapreview/

Let's discuss how we can make code reviews more effective, efficient, and valuable for our teams!

2Jet_Xu1y ago0

Show HN: LlamaPReview – AI GitHub PR reviewer that learns your codebase (opens in new tab)

(github.com)GitHub

102Jet_Xu1y ago42

Recent submissions

Strong feeling: we are in a folded AI reality

Some people think Agentic AI could do everything, is getting more and more powerful even feel fear about it.

Another group non-technical people still just trapped in the LLM chat is weak and full of hallucination world. Looks like AI is just a bubble and useless(a stronger search engine)

The main watershed is Agentic.

1Jet_Xu2mo ago1

Show HN: DocMason – Agent Knowledge Base for local complex office files (opens in new tab)

(github.com)GitHub

11Jet_Xu2mo ago0

Ask HN: Whether there is LLM or other strong NLP behind Hacker News?

I found that if your post is a LLM generated content, it will be auto-hide.

And I heard long time ago for same product you could not share in Show HN for multiple times.

What is the mechanism behind? Is Hacker News using LLM to judge each post or retrieve the product name and deduplicate in Show HN?

3Jet_Xu2mo ago1

Ask HN: The repo is the app. Codex is the runtime. Could this be future pattern?

I am developing a repo-native agent app for analyst-grade answers over complex private file.

I have developed an auto sync. Knowledge base in Codex and a set of skills + scripts. Then I realized, this could be a repo with an AGENTS.md to manage all pipelines.

Now all you need to do is download the repo, open it in Codex, and start chatting with your private working files.

I call this repo pattern is: The repo is the app. Codex is the runtime.

What do you think about this pattern? Maybe it could be the future :) due to we could fully leverage the tokens and AI Agent capability of Codex/Claude Code more than just coding.

1Jet_Xu2mo ago0

Are you drowning in AI code review noise? 70% of AI PR comments are useless

Most AI code review tools generate 10-20 comments per PR. The problem? 80% are noise. Here's a framework for measuring signal-to-noise ratio in code reviews - and why it matters more than you think.

https://jetxu-llm.github.io/posts/low-noise-code-review/

2Jet_Xu7mo ago0

Show HN: AI code reviewer that analyzes cross-module impacts

Hey HN! I'm Jet, building LlamaPReview.

We built a context retrieval engine that pulls in related code from across the repo before analysis. On complex PRs, it auto-generates Mermaid diagrams showing cross-file impacts.

Site: https://jetxu-llm.github.io/LlamaPReview-site/

I'm here for the next few hours to answer technical questions or get roasted for my monetization strategy.

1Jet_Xu8mo ago0

Ask HN: DeepSeek V3's AI Code Review Performance – A Reality Check with Data

Test setup:

  Models tested: Mistral-Large-2411, Gemini 2.0 Flash (thinking-exp-1219), Mistral-Nemo-12B, DeepSeek V3, and "a ReAct AI Agent(also based on Mistral-Nemo-12B)" implementation
  
  Consistent testing environment: Same prompts, temperature settings, and max_tokens (8192 for DeepSeek)
  
  Test data: Real-world PRs from various open-source projects (all in English)
  
  Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency

Results (in order of performance):

  1. Gemini 2.0 Flash (thinking-exp-1219) - it has the best in depth code review result, but the output format could not fulfill requirement perfectly (compare with Mistral AI models)
  
  2. Mistral-Large-2411
  
  3. Mistral-Nemo-12B + ReAct AI Agent
  
  4. DeepSeek V3
  
  5. Mistral-Nemo-12B

Key findings:

  - Despite recent marketing claims, DeepSeek V3 only marginally outperformed a 12B model from July
  
  - The price-performance ratio is concerning, especially after their February 8th pricing changes
  
  - Larger parameter count (671B) didn't translate to better PR review quality

Questions for the community:

  1. Has anyone else noticed similar performance gaps with DeepSeek V3?
  
  2. What metrics should we standardize for comparing LLM performance in specific tasks like code review?
  
  3. How much should marketing claims influence our technical evaluations?

Would love to hear your experiences and thoughts, especially from those who've tested multiple models in production environments.

2Jet_Xu1y ago1

Ask HN: AWS DynamoDB Triggers – A Time Bomb in Your Production Environment?

I recently discovered what I consider a serious design flaw in AWS DynamoDB Triggers that I believe deserves more attention from the community.

What's particularly puzzling is that other AWS services (API Gateway, EventBridge, etc.) handle versioning and aliases perfectly well. Why is DynamoDB different?

    if not validate_deployment_state():
        return
        
    # Actual business logic here

```

But this feels like we're working around a problem that shouldn't exist in the first place.

Curious to hear others' experiences and thoughts on this. Have you encountered similar "gotchas" in AWS services that seem to go against cloud deployment best practices?

3Jet_Xu1y ago3

- Ensuring consistent review quality across large teams and time zones

- Conducting comprehensive reviews in small teams without slowing development

- Focusing on the right aspects: architecture, logic, or style

- Using PR reviews for knowledge sharing and mentorship effectively

--------------------------------

I'd love to hear your thoughts:

  1. What's your team's current review process? What works well, and what doesn't?

  2. How do you measure the effectiveness of your review process?

  3. What kind of support do you think an ideal AI PR review tool should provide? Comprehensive analysis, focused insights, or specialized in certain aspects?

Let's discuss how we can make code reviews more effective, efficient, and valuable for our teams!

2Jet_Xu1y ago0

Show HN: LlamaPReview – AI GitHub PR reviewer that learns your codebase (opens in new tab)

(github.com)GitHub

102Jet_Xu1y ago42