But how does someone use Catboost across a cluster of 10 machines ? All the help documents are heavily single machine. Is there any kind of infra-framework that will distributed the jobs across all the machines running Catboost, etc ?
Interesting tidbit (to me anyway): This was 2006, before you could use something like AWS for the purpose and we were trying to keep costs to a minimum. (IBM had their "public one" grid but it was unusable.) The Core 2 Duo processor had just come out so I hired an intern from Cal—now a PhD and brilliant engineer at Netflix—to figure out the optimal overclocking rig and we built 32 chassis consisting of just motherboard, RAM, NIC, custom cooling/heatsink, and power supply. The problem was then how to deploy these in a colo. At the time there were some low end providers at 200 Paul willing to get creative with a cabinet so I found a machinist (metal worker?) able to cut some custom aluminum shelving on top of which we could stack the ATX cases. Rigged the boxes up to network boot off one of the nodes, compiled our application with Intel's C++ compiler to take advantage of the SIMD/SSE3 instruction set, and away we went running billions of simulations on a startup budget.
CatBoost team here.
CatBoost is currently single host, the version of training distributed on cluster will be open-sourced later.
1. Yandex is annoucing a new ML library and that makes it news because Yandex is well established
2. Gradient boosting is quite effective and popular
3. Not everything has to be about deep learning
That's some strange definition.
But these points all belong in some section entitled "why use gradient boosting instead of another ML method?", not in a definition of gradient boosting.
https://arxiv.org/abs/1706.04964 "Learning Deep ResNet Blocks Sequentially using Boosting Theory"
(As for the lay-man description: I thought boosting performed better out-of-the-box on dense data than on sparse data, because most feature sub-selections for bagging are on zero'd features)
So that's awesome...
The benchmarks at the bottom of https://catboost.yandex/ are somewhat useful though. I do remember when LightGBM came out and the benchmarks vs XGB were... very selective though.