Although the term "regression" is a misnomer anyway, and often when people say "regression" they mean "linear model". And by "linear model", we mean specifically a model in which outputs/predictions are some fixed linear combination of the input.
It is however possible to interpret the Kalman filter as a kind of dynamic regression model. Check out here if you want a good math workout on that topic: https://stats.stackexchange.com/q/330696
(Another somewhat distinct meaning of the term "regression" is any model with a "continuous" outcome variable. This is usually in contrast to "classification", which is any model that has a "categorical" or discrete outcome variable.)
Suppose I have exogenous variables that vary over time, X(t). X is about 100 features. What are some methods I can apply onto X(t) to automatically engineer features that may be useful at predicting some noisy y(t)?
I want to simultaneously capture interactions/interdependence between the columns of X, as well as the autocorrelation structure of X.
If I treat X as merely tabular data, throwing it into a traditional regression model (e.g. XGBoost), it can capture the interdependence structure in X, but it will neglect the autocorrelation structure... Unless I manually engineer features that capture the autocorrelation structure in X (e.g. rolling/shifted/differenced features), but I want to explore methods that do that automatically.
Usually our models are doing something like "Y = f(X) + E" where E is some unknown random noise and f() is the relationship that we are trying to infer from the data. We usually take X as "given" or "known", so in that case we are looking at Y conditional on some specific value of X.
If we are just trying to make good predictions, then we don't necessarily care about the structure among the components of X unless that structure tells us something about how Y is affected by X.
Imagine the following "true" relationships in the data, where E and H are unmeasurable random noise:
Y(t) = b0 + b1 * X(t) + b2 * X(t-1) + E(t)
X(t) = c * X(t-1) + H(t)
Knowing b0, b1, and b2 is sufficient to predict "Y minus random noise". Knowing c doesn't help us at all.If you're interested in obtaining good-quality estimates of b1 and b2, then you'll have a problem. That's because the direct effect of X(t-1) on Y is conflated with the indirect effect of X(t-1) on Y via X(t). But if you're just trying to make good predictions for Y, then you don't care as much about confidently distinguishing between b1 and b2.