1Insights from Multilingual Curation for a 20T-Token Dataset (opens in new tab)(datologyai.com)1hurrycane1mo ago0
2DatBench fixes VLM evals: 70% blindly solvable, 42% mislabeled, 35% prod gap (opens in new tab)(datologyai.com)5hurrycane2mo ago0
3DatBench: Cut VLM eval compute by >10× while INCREASING signal (opens in new tab)(datologyai.com)4hurrycane2mo ago0
4Luxical: Lexical-Dense Embeddings for Web-Scale Data Curation (3×–100× Faster) (opens in new tab)(datologyai.com)3hurrycane3mo ago0
6BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining (opens in new tab)(blog.datologyai.com)1hurrycane7mo ago0
7BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining (opens in new tab)(blog.datologyai.com)3hurrycane7mo ago0
8Image-Text Curation for 1B+ Data: Faster, Better, Smaller Clip Models (opens in new tab)(datologyai.com)12hurrycane1y ago0
10Augmenting Segment customer data with behavioral signals using the Moonsense SDK (opens in new tab)(moonsense.io)1hurrycane3y ago0
11Moonsense Recorder – Build live prototypes using mobile device sensor data (opens in new tab)(moonsense.io)2hurrycane4y ago0
12From the Gym to a Jupyter Notebook – Building a Squats Counter App in a Day (opens in new tab)(urimerhav.medium.com)7hurrycane4y ago0
14Reducing indexing latency of Twitter Search to one second (opens in new tab)(blog.twitter.com)3hurrycane5y ago0