LightGBM – A fast, distributed, gradient boosting framework (opens in new tab)

(github.com)

104 pointsgwulf9y ago11 comments

11 comments

11 comments · 6 top-level

tadkar9y ago· 2 in thread

This looks like an interesting project. I'd take the accuracy results with a pinch of salt because growing deeper trees often improves accuracy and in the test scenario xgboost is handicapped by limited depth. As the author says on reddit its difficult to do an apples to apples comparison of the two methods directly, because their approach to growing trees is very different [a bit like DFS vs BFS]

The thing that is more relevant for 'real-world' data is whether this library supports categorical features at all. The answer seems to be that it doesn't (then again neither does xgboost).

The text in the Parallel experiments section [1] suggests that the result on the Criteo dataset was achieved by replacing the Categorical features by the CTR and the count.

[1] From https://github.com/Microsoft/LightGBM/wiki/Experiments#paral...: "This data contains 13 integer features and 26 category features of 24 days click log. We statistic the CTR and count for these 26 category features from first ten days, then use next ten days’ data, which had been replaced the category features by the corresponding CTR and count, as training data. The processed training data has total 1.7 billions records and 67 features."

nerdponx9y ago

At the risk of outing myself as behind the cutting-edge, what is CTR?

tadkar9y ago

Sorry, CTR=click through rate. The Criteo dataset is a real world ad-click prediction task.

nl9y ago· 2 in thread

https://github.com/Microsoft/LightGBB/wiki/Experiments#compa...

At least 3 times faster than XGBoost AND more accurate. Wow.

I'm off to Kaggle now.

dswalter9y ago

I'm guessing you meant to link this instead. https://github.com/Microsoft/LightGBM/wiki/Experiments#compa...

nl9y ago

Indeed. Pretty bad when I can't even cut and paste. No idea how I managed that, but too late to edit it now.

TheGuyWhoCodes9y ago· 1 in thread

Looks fantastic!

I'd love to have a python interface for this, just drop a pandas frame, maybe scikit-learn interface with fit/predict. Saving/Loading models... This will definitely boost adoption.

minimaxir9y ago

That is explicitly in the future plans: https://github.com/Microsoft/LightGBM/wiki/Features

nerdponx9y ago

Very interesting. Growing trees "leaf-wise" is more intuitive in my opinion.

That said, I don't see a single equation on that page. Is there an Arxiv paper or something behind this?

itschekkers9y ago

looks nice, especially in reducing memory use. it would have been great if they built in k-fold cross-validation by default too

botexpert9y ago

10-20 year old methods implemented properly. Amazing.

j / k navigate · click thread line to collapse

11 comments

11 comments · 6 top-level

tadkar9y ago· 2 in thread

The thing that is more relevant for 'real-world' data is whether this library supports categorical features at all. The answer seems to be that it doesn't (then again neither does xgboost).

The text in the Parallel experiments section [1] suggests that the result on the Criteo dataset was achieved by replacing the Categorical features by the CTR and the count.

nerdponx9y ago

At the risk of outing myself as behind the cutting-edge, what is CTR?

tadkar9y ago

Sorry, CTR=click through rate. The Criteo dataset is a real world ad-click prediction task.

nl9y ago· 2 in thread

https://github.com/Microsoft/LightGBB/wiki/Experiments#compa...

At least 3 times faster than XGBoost AND more accurate. Wow.

I'm off to Kaggle now.

dswalter9y ago

I'm guessing you meant to link this instead. https://github.com/Microsoft/LightGBM/wiki/Experiments#compa...

nl9y ago

Indeed. Pretty bad when I can't even cut and paste. No idea how I managed that, but too late to edit it now.

TheGuyWhoCodes9y ago· 1 in thread

Looks fantastic!

I'd love to have a python interface for this, just drop a pandas frame, maybe scikit-learn interface with fit/predict. Saving/Loading models... This will definitely boost adoption.

minimaxir9y ago

That is explicitly in the future plans: https://github.com/Microsoft/LightGBM/wiki/Features

nerdponx9y ago

Very interesting. Growing trees "leaf-wise" is more intuitive in my opinion.

That said, I don't see a single equation on that page. Is there an Arxiv paper or something behind this?

itschekkers9y ago

looks nice, especially in reducing memory use. it would have been great if they built in k-fold cross-validation by default too

botexpert9y ago

10-20 year old methods implemented properly. Amazing.

j / k navigate · click thread line to collapse