1Insights from Multilingual Curation for a 20T-Token Dataset (opens in new tab)datologyai.com1hurrycane2mo ago0
2DatBench fixes VLM evals: 70% blindly solvable, 42% mislabeled, 35% prod gap (opens in new tab)datologyai.com5hurrycane4mo ago0
3DatBench: Cut VLM eval compute by >10× while INCREASING signal (opens in new tab)datologyai.com4hurrycane4mo ago0
4Luxical: Lexical-Dense Embeddings for Web-Scale Data Curation (3×–100× Faster) (opens in new tab)datologyai.com3hurrycane5mo ago0
6BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining (opens in new tab)blog.datologyai.com1hurrycane8mo ago0
7BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining (opens in new tab)blog.datologyai.com3hurrycane8mo ago0
8Image-Text Curation for 1B+ Data: Faster, Better, Smaller Clip Models (opens in new tab)datologyai.com12hurrycane1y ago0
10Augmenting Segment customer data with behavioral signals using the Moonsense SDK (opens in new tab)moonsense.io1hurrycane4y ago0
11Moonsense Recorder – Build live prototypes using mobile device sensor data (opens in new tab)moonsense.io2hurrycane4y ago0
12From the Gym to a Jupyter Notebook – Building a Squats Counter App in a Day (opens in new tab)urimerhav.medium.com7hurrycane4y ago0
14Reducing indexing latency of Twitter Search to one second (opens in new tab)blog.twitter.com3hurrycane5y ago0