randomwalker on Hacker News

1

Did Google's AI agents build an operating system for $916? (opens in new tab)

(normaltech.ai)

4randomwalker1mo ago0

2

Open-world evaluations for measuring frontier AI capabilities [pdf] (opens in new tab)

(cruxevals.com)PDF

2randomwalker2mo ago0

3

Towards a science of AI agent reliability (opens in new tab)

(normaltech.ai)

1randomwalker4mo ago0

4

When AI Builds AI – Findings from a Workshop on Automation of AI R&D [pdf] (opens in new tab)

(cset.georgetown.edu)PDF

1randomwalker4mo ago0

5

The Longitudinal Expert AI Panel: Understanding Expert Views on AI [pdf] (opens in new tab)

(static1.squarespace.com)PDF

1randomwalker7mo ago0

6

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation (opens in new tab)

(arxiv.org)arXiv

1randomwalker8mo ago0

7

America's AI Action Plan [pdf] (opens in new tab)

(whitehouse.gov)PDF

11randomwalker11mo ago0

8

Could AI slow science? Confronting the production-progress paradox (opens in new tab)

(aisnakeoil.com)

2randomwalker11mo ago0

9

AI as Normal Technology (opens in new tab)

(knightcolumbia.org)

239randomwalker1y ago92

10

Why an overreliance on AI-driven modelling is bad for science (opens in new tab)

(nature.com)

1randomwalker1y ago0

11

Is AI progress slowing down? (opens in new tab)

(aisnakeoil.com)

5randomwalker1y ago1

12

We Looked at 78 Election Deepfakes. Political Misinformation Isn't an AI Problem (opens in new tab)

(knightcolumbia.org)

5randomwalker1y ago0

13

Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers (opens in new tab)

(arxiv.org)arXiv

3randomwalker1y ago0

14

Is the UK's liver transplant matching algorithm biased against younger patients? (opens in new tab)

(aisnakeoil.com)

93randomwalker1y ago62

15

Core-Bench: Computational Reproducibility Agent Benchmark (opens in new tab)

(arxiv.org)arXiv

1randomwalker1y ago0

randomwalker

Recent submissions

Did Google's AI agents build an operating system for $916? (opens in new tab)

Open-world evaluations for measuring frontier AI capabilities [pdf] (opens in new tab)

Towards a science of AI agent reliability (opens in new tab)

When AI Builds AI – Findings from a Workshop on Automation of AI R&D [pdf] (opens in new tab)

The Longitudinal Expert AI Panel: Understanding Expert Views on AI [pdf] (opens in new tab)

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation (opens in new tab)

America's AI Action Plan [pdf] (opens in new tab)

Could AI slow science? Confronting the production-progress paradox (opens in new tab)

AI as Normal Technology (opens in new tab)

Why an overreliance on AI-driven modelling is bad for science (opens in new tab)

Is AI progress slowing down? (opens in new tab)

We Looked at 78 Election Deepfakes. Political Misinformation Isn't an AI Problem (opens in new tab)

Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers (opens in new tab)

Is the UK's liver transplant matching algorithm biased against younger patients? (opens in new tab)

Core-Bench: Computational Reproducibility Agent Benchmark (opens in new tab)

Recent submissions

Did Google's AI agents build an operating system for $916? (opens in new tab)

Open-world evaluations for measuring frontier AI capabilities [pdf] (opens in new tab)

Towards a science of AI agent reliability (opens in new tab)

When AI Builds AI – Findings from a Workshop on Automation of AI R&D [pdf] (opens in new tab)

The Longitudinal Expert AI Panel: Understanding Expert Views on AI [pdf] (opens in new tab)

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation (opens in new tab)

America's AI Action Plan [pdf] (opens in new tab)

Could AI slow science? Confronting the production-progress paradox (opens in new tab)

AI as Normal Technology (opens in new tab)

Why an overreliance on AI-driven modelling is bad for science (opens in new tab)

Is AI progress slowing down? (opens in new tab)

We Looked at 78 Election Deepfakes. Political Misinformation Isn't an AI Problem (opens in new tab)

Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers (opens in new tab)

Is the UK's liver transplant matching algorithm biased against younger patients? (opens in new tab)

Core-Bench: Computational Reproducibility Agent Benchmark (opens in new tab)