undefined | Better HN

0 pointsVaslo2y ago0 comments

My job is now primarily Time Series Forecasting, and we’ve spent so much time improving our feature selection and engineering. When I started I thought “run correlations against target variables, find the best bunch and as long as we can explain them and their relation to the target we are good”

I was wrong.

0 comments

10 comments · 2 top-level

bigger_cheese2y ago· 8 in thread

I work mostly with regressions and often it is almost more informative when something you expected to be a significant term isn't. Can help track down interesting behavior.

More recently Machine learning has really enhanced what you can do with regression. For example multivariate regressions when there are non-linear (or partially linear) relationships between feature and target variables.

For example recent regression problem involved a chemical reaction. It was suspected that a particular feature above a threshold began to display non linear behavior but it was difficult to pinpoint exactly where it began departing from linearity. ML was very helpful analyzing this.

Other than regressions and timeseries forecasting I think it's worth knowing about K-means clustering and PCA (Principal Component Analysis)/ PLS (Projection to latent structures) as well.

I've found PCA to be pretty unknown but very useful I've had success using it in the past and found it useful to explain the relationship not just between the data features and the target variable but also how the features relate to each other.

applied_heat2y ago

I’m just about to start digging in to 8 years of data from a few power plants with 16 turbines in total to see if I can identify some problems we might have before the sensor measurements exceed the alarm threshold.

Taking bearing temperature as an example, I think I will identify periods of time where the machine has already been generating for an hour so temperature have stabilized and then I will have bearing oil inlet temperature and machine load as independent variables, and bearing oil outlet and bearing metal temperatures as dependent values. Seems like it should be straightforward to find any anomalies but I’ve just started googling how to do this yesterday. There are lots of vendors hawking predictive maintenance software but I can’t imagine that I couldn’t get similar results with a few weeks effort and armed with Python and all of the associated libraries

bigger_cheese2y ago

Maybe try slopes and second derivatives (change in temperature over time and so forth) could also try introducing various lag windows into timeseries data.

edit: I've Also seen a lot of pitches about predictive maintenance / automated anomaly detection. I think the appeal lies in having a one size fits all solution you can apply to multiple pieces of equipment (fans, conveyor belt drives, pumps etc) and not needing to develop/deploy/maintain bespoke models.

A lot of manufacturing sites won't have a data person on tap (or even people who can write python). Also there are challenges with deployment etc especially in remote sites where access is difficult, data connectivity is bad etc (think like oil/gas pipelines). Most of the pitches seem to combine running ML models and using some kind of iot device with something like lorawan for connectivity..

1 more reply

mr_toad2y ago

A good first step would be a scatterplots, time series plots and mark 1 eyeballs. It helps to understand the shape of the data before you start trying to fit models.

1 more reply

nerdponx2y ago

If you have any kind of functional physical model of part where and eventual failure, you have a huge head start.

That said, there can be a pretty big gap between detecting individual sensor anomalies (undergrad homework) and predicting component failure (build an entire business around it). I have never regretted starting a data project with a small, easy task, and ramping up from there. Whereas I have definitely regretted starting a data project with big goals and/or fancy techniques at the beginning. Set clear incremental goals, and use the early prototyping phases to explore the data and develop a good understanding for what might or might not be possible to accomplish with it.

afrnz2y ago

Having worked in the same problem space, I can heavily recommend to get expert input when evaluating which features to use. Ideally, this is a person who knows the internals of the machinery and/or operations that can help you remove spurious features. As a Data Scientist, one sometimes tends to think that the data explains everything and no expert domain knowledge is needed ("Modern machine translation does work without any knowledge of grammar or language!"). Good luck!

levocardia2y ago

You should look into using generalized additive models (GAMs). They are regression models that allow you to model nonlinear relationships, and even smooth nonlinear interactions between variables, while retaining the benefits of classical regression models like statistically valid confidence intervals and the ability to control for repeated measures. You can also explicitly model periodic behavior, like 24-hour or annual cycles in a predictor variable, and even account for auto-correlation explicitly.

In your example, you could not only pinpoint departure from linearity, but you could get a 95% confidence interval for it.

The best implementation is mgcv in R; pyGAM in python is ok but lacks many of the more advanced features in mgcv. There's even a more ML-flavored implementation in mboost

pbowyer2y ago

This is a case where I have a hard time getting my head round how and why machine learning helps. What models are there available, and what training data do you use? Any background would be appreciated, I use ML for feature detection and image classification, but not yet for regressions.

bigger_cheese2y ago

I'm not a data scientist (my background is Engineering) I use Azure ML Studio. the regression feature uses an ensemble of different algorithms there is an explanation here.

https://learn.microsoft.com/en-us/azure/machine-learning/com...

Once the model has run it uses something called a mimic to generate model explainability, which lets you explore things like feature importance etc in the final model. As far as the user interface goes I mostly used SAS in the past and it feels quite similar.

eyegor2y ago

Wait I still do this, what are your secrets?

j / k navigate · click thread line to collapse

0 comments

10 comments · 2 top-level

bigger_cheese2y ago· 8 in thread

I work mostly with regressions and often it is almost more informative when something you expected to be a significant term isn't. Can help track down interesting behavior.

Other than regressions and timeseries forecasting I think it's worth knowing about K-means clustering and PCA (Principal Component Analysis)/ PLS (Projection to latent structures) as well.

applied_heat2y ago

bigger_cheese2y ago

Maybe try slopes and second derivatives (change in temperature over time and so forth) could also try introducing various lag windows into timeseries data.

1 more reply

mr_toad2y ago

A good first step would be a scatterplots, time series plots and mark 1 eyeballs. It helps to understand the shape of the data before you start trying to fit models.

1 more reply

nerdponx2y ago

If you have any kind of functional physical model of part where and eventual failure, you have a huge head start.

afrnz2y ago

levocardia2y ago

In your example, you could not only pinpoint departure from linearity, but you could get a 95% confidence interval for it.

The best implementation is mgcv in R; pyGAM in python is ok but lacks many of the more advanced features in mgcv. There's even a more ML-flavored implementation in mboost

pbowyer2y ago

bigger_cheese2y ago

I'm not a data scientist (my background is Engineering) I use Azure ML Studio. the regression feature uses an ensemble of different algorithms there is an explanation here.

https://learn.microsoft.com/en-us/azure/machine-learning/com...

eyegor2y ago

Wait I still do this, what are your secrets?

j / k navigate · click thread line to collapse