One of the fundamental problems with configuration files is that they are written in ad-hoc external DSLs. This means that the config files behave nothing like actual software--instead, they have their own special rules to follow and their own semantics to understand. These languages also tend to have very little abstractive power, which leads to all sort of incidental complexity and redundancy. Good software engineering and programming language practices are simply thrown away in config files.
Some of these issues are mitigated in part by using a well-understood format like XML or even JSON. This is great, but it does not go far enough: config files still stand alone and still have significant problems with abstraction and redundancy.
I think a particularly illuminating example is CSS. When you get down to it, CSS is just a configuration file for how your website looks. IT certainly isn't a programming language in the normal sense of the word. And look at all the problems CSS has: rules are easy to mess up, you end up copying and pasting the same snippets all over the place and css files quickly become tangled and messy.
These problems are addressed to a first degree by preprocessors like Sass and Less. But they wouldn't have existed in the first place if CSS was an embedded DSL instead of a standalone language.
At the very least, being an embedded language would give it access to many of the features preprocessors provide for free. You would be able to pack rules into variables, return them from functions and organize your code in more natural ways. Moreover, you could also take advantage of your language's type system to ensure each rule only had acceptable values. Instead of magically not working at runtime, your mistakes would be caught by compile time.
This is particularly underscored by how woefully complex CSS3 is getting. A whole bunch of the new features look like function calls or normal code; however, since CSS is not a programming language, it has to get brand new syntax for this. This is both confusing and unnecessary: if CSS was just embedded in another language, we could just use functions in the host language.
I think this approach is promising for configuration files in general. Instead of arbitrary custom languages, we could have our config files be DSL shallowly embedded in whatever host language we're using. This would simultaneously give us more flexibility and make the configuration files neater. If done correctly, it would also make it very easy to verify and process these configuration files programmatically.
There has already been some software which has taken this approach. Emacs has all of its configuration in elisp, and it's probably the most customized software in existence. My .emacs file is gigantic, and yet I find it much easier to manage than configuration for a whole bunch of other programs like Apache.
Another great example is XMonad. This is really more along the lines of what I really want because it lets you take advantage of Haskell's type system when writing or editing your configuration file. It catches mistakes before you can run them. Since your config file is just a normal .hs file, this also makes it very easy to add new features: for example, it would be trivial to support "user profiles" controlled by an environment variable. You would just read the environment variable inside your xmonad.hs file and load the configuration as appropriate.
Specifying configuration information inside a general-purpose language--probably as an embedded DSL--is quite a step forward from the ad-hoc config files in use today. It certainly won't solve all the problems with configuration in all fields, but I think it would make for a nice improvement across the board.
I have to disagree that this is the critical problem with system configuration. The typical config has orders of magnitude less complexity than the typical software project. The problem with config files is not the complexity of the configuration itself (except perhaps in Java), but the complexity of the global effects they produce. It doesn't matter how clean and DRY you make your config, and how well-tested, if the software supports too many options that can interact in unpredictable ways, or worse, produce subtle outward facing changes that ripple out to systems that interact with yours.
It's more a comment on config formats in general.
- Automatically scanning for insecure configurations (eg. OpenSCAP)
- Parsing the configuration in other programs.
- Modifying the configuration programmatically (cf. Puppet et al)
Also, http://augeas.net/
Similarly, this would make modifying configuration programmatically better. Instead of dealing with serializing and deserializing to random text formats, you would just provide a Config -> Config function. This would also make tools modifying configuration parameters more composable because they wouldn't be overwriting each other's changes unnecessarily.
> - Automatically scanning for insecure configurations (eg. OpenSCAP)
Since the language can't access the outside world, the worst it can do is use unbounded space or time. Just verify that it halts in a couple ms.
> - Parsing the configuration in other programs.
You don't parse, you embed an interpreter and execute.
> - Modifying the configuration programmatically (cf. Puppet et al)
Code generation isn't that hard.
The program (written in go) could simply launch a new process on startup, e.g.
python -c "import imp; c = imp.load_source('c', 'abc.conf'); print '\n'.join('%s = %s' % (n, repr(getattr(c, n))) for n in ['option1', 'option2', 'option3'])"
and read it's output (which could also be another well-known format, like JSON) and use it for configuration. Simple, elegant, and enables very powerful abstractions.Nevertheless I agree that standardization is a problem. Not for the "linux community", but in general. Some people do not require Turing-completeness and (reasonably) do not want it. Others already have a scripting language (Lua,TCL,Python,etc) embedded, so reusing it is a good idea.
They're chaotic systems[1]. Parameters which are below the threshold of detection can drastically affect their trajectory in phase space. Accidental complexity has zoomed off to live a life all its own, and that takeoff has been continuous for decades [2].
Everything we build around software that's not directly bearing on the problem domain is chipping away at this or that aspect of the ungoverned bits. As we corral each level of new unpredictability, we merely set the scene of the next level of complexity.
There is no universal solution. There is no final boss which, once defeated, leads to the scrolling names and 8-bit themesong. It never stops, ever, because the value of systems that frequently fail unpredictably is lower than the cost of flawless systems and higher than the value of not building such systems.
Edit: sounds very unhelpful, even defeatist. I know. I'm deeply ambivalent about what can be done versus what we should assume is possible. I'm a fan of serious software engineering and tool support and all that jazz ... yet thoroughly pessimistic about the limits of the possible. Deep sighs for every occasion!
[1] http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reign... (warning, self linkage)
[2] http://www.cs.nott.ac.uk/~cah/G51ISS/Documents/NoSilverBulle...
The example of FDs from the original article seems like a good one; a properly developed expert system might have been able to notice a lack of changes in the current software environment and suggest looking elsewhere. (Problem is, if I got that response from my automated problem-solver I'd just ignore it and keep looking . . .)
So, any insightful points to crush my youthful idealism? :)
First problem: Need some 'experts'. So, only know about problems the experts do. Their knowledge is supposed to be mostly just empirical, from their 'expert experience'. Even if they have 'deep knowledge', that is, how the systems work internally (e.g., the engine connects to the torque converter connects to the transmission connects to the rear differential connects to the rear wheels), the specific problems they know about are usually just the ones they have encountered. So, essentially can only address problems seen before and are well understood.
But for the problems of the OP, they were seen for the first time. Bummer. Actually, with irony, if the problems have been seen before and are well understood, then why the heck have they not already been solved?
Second, with expert systems, we're working just intuitively from just experience. So, we have little idea if what we are doing is the best possible in any reasonable sense or even at all good. In particular, in monitoring, the main, first operational problem on the 'bridge' or in the 'NOC' (network operations center) is the false alarm rate too high with no good way to lower it (except just to ignore some alarms, but we should be able to do much better).
Net, for a serious attack on system monitoring and management, I can't recommend taking expert systems very seriously.
We passed the point quite a while back where "applied CS" -- what it's like to build and maintain scalable systems in the real world -- looks a lot more like anthropology than like mathematics.
We have tools like sampling profilers, loggers that do stateful packet inspection on a small percentage of traffic, and sophisticated A/B testing frameworks that reason about user populations rather than individuals. But I think we could probably build a whole 'nother generation of these tools by thinking about this problem of understanding complex systems from some new perspectives. I nominate anthropology as a discipline that has many years under its belt wrestling with the impossibility of complete description. (And wrestling with other tough issues, sometimes not well, but that's another conversation.)
I'm not talking about the kind of "anthropology" work that companies like Intel have done for a long time, which are firmly in the social sciences tradition and are really user studies. (Though that work is valuable.) I'm talking about porting ideas such as "thick description" to our technical toolkit.
http://en.wikipedia.org/wiki/Thick_descriptionWhich of course is not true. There are however various attempts to make a "proper" "modern" OS but all of these have failed due to the lack of commercial backing. There is apparently no financial interest into having a secure, scalable, fast, deterministic, simple OS right now (which, oh, requires rewriting all your apps, obviously)
Even Microsoft tried. Yes, Microsoft did an excellent such OS.
The Mindori project is still secret, so who knows what they are doing there, while some of the team is nowadays playing with the concept of pico-hipervisors and running the full OS as a library instead.
What do you mean by "safe" though? How is kernel-space safer then user-space?
The modern oses do have kernels that are much safer than what we use day to day, however. They're indeed not in C. In general the core is typed ASM and it runs a a managed language right on top. The kernel is generally written with that and uses various concepts that separate privileges as much as possible for security, reliability and predictability, basically. That includes kernel components, and obviously drivers. Messages between components are also generally declared ("contracted") with well defined types.
It reminds of the scene from the film Election, where a character intones, after you've seen him ski down a double-black diamond slope and crash, "Last winter when I broke my leg, I was so mad at God."
We're spending money to squeeze the last drops of gold out of the research done in the 60's and 70's, while research starves, and then complaining that there's nothing new. I don't see how we can expect anything different.
systems are about interactions, and he's identified the challenges in analyzing systems.
this isn't so much about work from the 60s and 70s, but about modern architectures and the new things they enable. we're still working with models form the 60s and 70s and ignoring various aspects of new system architectures and wondering why we still suck.
https://news.ycombinator.com/item?id=5717481
in this thread.
And you can be sure that a company like Google will have done configuration using flat files, key-value stores, DSLs, general purpose languages that generate config files, general purpose languages that actually execute stuff, and probably many other ways :-) If there were simple solutions like that, they'd already be in use.
Here's a small list of desirable properties:
- Configurations must be reusable, composable and parametrizable. When running a system with multiple components, you need to be able to take base configurations of those two components and merge them into a useful combination, rather than writing the configurations from scratch. You must also be able to parametrize those base configurations in useful ways -- which external services to connect to, which data center and on how many machines to run on, etc. And note -- this can't be a one time thing. If the base configurations change, there needs to be some way of eventually propagating those changes to all users.
- Configurations must be reproducible. If the configuration is a program, it shouldn't depend on the environment where it's run on (nor should it be able to have side-effects on the environment). Why? Because when somebody else needs to fix your system at 3am, the last thing you want them to do is need to worry about exactly replicating the previous person's setup.
- Tied to the previous point, configurations also need to be static or snap-shottable. If a server crashes and you need to restart a task on another machine, it'd be absolutely inexcusable for it to end up using a different configuration than the original task due to e.g. time dependencies in programmatic configuration.
- It must be possible to update the configuration of running services without restarting them, and it must be possible to roll out config changes gradually. During such a gradual rollout you need to have options for manual and automatic rollback if the new configuration is problematic on the canaries.
- Configurations need to be debuggable. Once your systems grow to having tens of thousands of lines of configuration or hundreds of thousands of programatically generated key-value pairs, the mere act of figuring out "why does this key have this value" can be a challenge.
- It'd be good if configurations can be validated in advance, rather than only when actually starting the program reading the configuration. At a minimum, this would probably mean that people writing configurations that other people use as a base be able to manually add assertions or constraints on whatever is parametrized. But you might also want constraints and validation against the actual runtime environment. "Don't allow stopping this service unless that service is also down", or something.
Some of these desirable properties are of course conflicting. And some are going to be things that aren't universally agreed on. For example there is a real tension between not being able to run arbitrary code from a configuration (important for several of these properties) vs. reuse of data across different services.
My thinking on this is already a few years out of date, it's probable that configuration has again gotten an order of magnitude more complex since the last time I thought about this stuff seriously, and there are already new and annoying properties to think about. The main point is just that this really is a bit more complex than your .emacs file is.
Recently I've been working on a simple microkernel (called Sydi) that should actually make this possible. It's a distributed system straight down to the kernel: when two instances of the kernel detect each other running on a network, they freely swap processes and resources between each other as required to balance the load. This is a nightmare from a security perspective, of course, and I haven't figured out the authentication bits yet -- too busy working on an AML interpreter because I don't like ACPICA -- but that's not the issue here. (It's far from finished, but I'm pretty confident that I've worked out the first 85% of the details, with just the remaining 115% left, plus the next 95% of the programming . . .)
Due to some aspects of its design (asynchronous system calls via message-passing, transparent message routing between nodes in the network, etc) I feel that it will be entirely possible to take a snapshot of an entire cluster of machines running the kernel. It would be expensive -- requiring a freeze of all running processes while the snapshot is taking place to maintain any level of precision -- but I'm confident I could code that in during the space of a week or two . . .
I haven't thought much about what kind of analysis one could do on the instrumentation/snapshot results, though. I'm sadly too inexperienced with `real-world' systems stuff to be able to say. Anyone have any suggestions for possible analysis avenues to explore?
Have you thought about distributed algorithms like the Chandy-Lamport Snapshot algorithm (http://en.wikipedia.org/wiki/Snapshot_algorithm) that take a consistent snapshot and do not require a system freeze?
Well, only because you're trusting your hardware. Most people take that on faith anyway.
If you're willing to trust the hardware, I'm fairly sure software-based isolation and security is relatively doable.
> An escape from configuration hell
debconf. chef, puppet. git. augeas. gsettings. In the Windows world, the registry.
There's tons of work being done in this area. You may say that more can be done, or that it isn't good enough, but I think you're wrong to claim that nobody is working on it.
> Understanding interactions in a large, production system
How about the Google Prediction API? Send the data in, and look for changes when you have a problem that you're trying to fix, such as in your example.
> Pushing the envelope on new computing platforms
This is incredibly vague. I find it difficult to understand your definition of "new computing platforms". Based on my interpretation of what you mean, I think a modern smartphone might qualify: a new form factor, more ubiquitous than before, novel UX (capacitive touchscreen), and an array of interesting sensors (accelerometers, compasses, GPS).
I think your post highlights the lamentation of AI researchers: we don't have a good, static measure of success. Once we've achieved something, our definition changes to exclude it.
> The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.
can be viewed as pointing to the solution:
Broadly in computing, we have 'resource limits' from the largest integer that can be stored in 2's complement in 32 bits to disk partition size, maximum single file size, number of IP ports, etc. Mostly we don't pay really careful attention all of these resource limits.
Well, for 'systems' such as are being considered in the OP, do consider the resource limits. So, for each such limit, have some software track its usage and then report the usage and especially when usage of that resource is close to be exhausted. And when the resource is exhausted, then report that the problem was the resource was exhausted and who the users of that resource were.
For a little more, can make some progress by doing simulations of 'networks of queues' with whatever stochastic processes want to assume for 'work arrival time' and 'work service times'.
For more, do some real time system monitoring. For problems that are well understood and/or have been seen before, e.g., running out of a critical resource, just monitor for the well understood aspects of the problem.
Then there are problems never seen before and not yet well understood. Here monitoring is essentially forced to be a continually applied 'statistical hypothesis test' for the 'health and wellness' of the system complete with rates of 'false positives' (false alarms) and rates of 'false negatives' (missed detections of real problems).
For the operational definition of 'healthy and well', collect some operational data, let it 'age' until other evidence indicates that the system was 'healthy and well' during that time, and, then, use that data as the definition. Then decree and declare that any data 'too far' from the healthy and well data indicates 'sick'.
Then as usual in statistical hypothesis testing, want to know the false alarm rate in advance and want to be able to adjust it. And if get a 'detection', that is, reject the 'null hypothesis' that the system is healthy and well, then want to know the lowest false alarm rate at which the particular data observed in real time would still be a 'detection' and use that false alarm rate as a measure of the 'seriousness' of the detection.
If make some relatively weak probabilistic assumptions about the data, then, in calculating the false alarm rate, can get some probabilistically 'exchangeable' data. Then get to apply some group theory (from abstract algebra) and borrow a little from classic ergodic theory to calculate false alarm rate. And can use a classic result in measure theory by S. Ulam, that the French probabilist LeCam nicely called 'tightness', to show that the hypothesis test is not 'trivial'. See Billingsly 'Convergence of Probability Measures'.
Of course, the 'most powerful' test for each given false alarm rate is from the Neyman-Pearson lemma (the elementary proofs are not much fun, but there is a nice proof starting with the Hahn decomposition, a fast result from the Radon-Nikodym theorem), but for problems never seen before we do not have data enough to use that result.
For the statistical hypothesis test, we need two special properties: (1) Especially for monitoring systems, especially complex or distributed systems, we should have statistical hypothesis tests that are multi-dimensional (multi-variate); that is, if we are monitoring once a second, then each second we should get data on each of a certain n variables for some positive integer n. So, we are working with n variables. (2) As U. Grenander at Brown U. observed long ago, operational data from computer systems is probabilistically, certainly in distribution, quite different from data from most of the history of statistics. He was correct. Especially since we are interested in data on several variables, we have no real hope of knowing the distribution of the data even when the system is 'healthy and well' and still less hope when it is sick with a problem we have never seen before. So, we need hypothesis tests that are 'distribution-free', that is, make no assumptions about probability distributions. So, we are faced with calculating false alarm rates for multi-dimensional data where we know nothing about the probability distribution.
There is a long history of hypothesis tests for from a single variable, including many tests that are distribution-free. See old books by S. Siegel, 'Nonparametric Statistics for the Behavioral Sciences' or E. Lehmann, 'Nonparametrics'.
For multi-dimensional hypothesis tests, there's not much and still less when also distribution-free.
Why multi-dimensional? Because for the two interacting systems in the example in the OP, we guess that to detect the problem we would want data from both systems. More generally in a large server farm or network, we are awash in variables on which we can collect data at rates from thousands of points per second down to a point each few seconds. So, for n dimensions, n can easily be dozens, hundreds, ....
For such tests, look at "anomaly detector" in 'Information Sciences' in, as I recall, 1999.
If want to implement such a test, might want to read about k-D trees in, say, Sedgewick. Then think about backtracking in the tree, depth-first, and accumulating some cutting planes.
For monitoring, there was long some work by Profs Patterson and Fox at Stanford and Berkelely, in their RAD Lab, funded by Google, Microsoft, and Sun. The paper in 'Information Sciences' seems to be ahead.
More can be done. E.g., consider the Rocky Mountains and assume that they are porous to water. Let the mountains be the probability distribution of two-variable data when the system is healthy and well. Now pour in water until, say, the lowest 1% of the probability mass has been covered. Now the test is, observe a point in the water and raise an alarm. Now the false alarm rate is 1%. And the area of false alarms is the largest we can make it for being 1% (a rough surrogate for the best detection rate -- with a fast use of Fubini's theorem in measure theory, can say more here). As we know, lakes can have fractal boundaries, and the techniques in the paper will, indeed, approximate those. Also the techniques do not require that all the lakes be connnected. And for the dry land, it need not all be connected either and might be in islands. So, it is quite general.
But may want to assume that the region of healthy and well performance is a convex set and try again. If this assumption is correct, then for a given false alarm rate will get a higher detection rate, that is, a more 'powerful' test.
Still more can be done. But, really, the work is essentially some applied math with, at times, some moderately advanced prerequisites from pure and applied math. Theme: The crucial content of the most powerful future of computing is in math, not computer science. Sorry 'bout that!
The problem is that inevitably the next problem is exhaustion of some resource that nobody even thought of as a resource. In my time at Google, some of the resource limits I've faced have included (besides the obvious RAM/disk/CPU): binary size, size of linker inputs, average screen size of websearch users, time required for an AJAX request, byte-size of the search results page, visual space on the monitoring UI, size of a Javascript alert box, maximum length of a URL in Internet Explorer, maximum length of a request in Google's load-balancers, time until someone submits a CL touching the same config file as you, and time before a cosmic ray causes a single-bit error in memory.
File descriptors are a comparatively obvious resource, but nobody thought of them because in previous normal operation they never even came close to running out. Should someone set an alert on "wind speed running through the Columbia River Gorge" (which would've been responsible for an actual outage had there not been significant redundancy in the system, BTW)?
Really, "All"?
First question is, 'parameters'. Usually, and likely at Google, they are monitoring just one parameter at a time. This is a really big bummer. E.g., even if do this for several parameters, are forced into assuming that the 'healthy and well' geometry of the parameters is just a rectangle. Bummer that hurts the combination of false alarm rate and detection rate.
Next, the "normal range" is a bummer because it assumes that what is healthy and well is just a 'range', that is, an interval, and this is not always the case. The result, again, is a poor combination of false alarm rate and detection rate.
Again, yet again, please read again, just for you, as I wrote very clearly, to do well just must be multi-dimensional. I doubt that there is so much as a single large server farm or network in the world that is doing anything at all serious working with multidimensional data for monitoring. Not one.
Next, your remark about false alarm rate points to a problem I point out is solved: The method, with meager assumptions, permits knowing false alarm rate in advance and setting it, actually setting it exactly.
For how many and what variables to monitor, yes, that is a question that can need some overview of the server farm or network and some judgment, but there is some analytical work that should help.
For "rigorous" statistics, the point is not 'rigor' but useful power. Being multidimensional, knowing false alarm rate and being able to adjust it, etc. are powerful.
Consider package management. You have an OS level package manager, you have OS level application management (on android, for example), you have node package management, you have ruby gems, you have perl and python packages, you have wordpress themes and plugins, etc. Obviously it doesn't make sense to have a single global package manager, but what about standards for package management? Competing package management infrastructures which serve as foundations for all these lower level systems? A mechanism for aggregating information and control for all of the package management sub-systems?
As the article points out a lot of systems folks and OS folks are still retreading the same paths. They want to pave those paths, then build moving walkways, and so forth. That's all fine and good, but there's a lot of wilderness out there, and a lot of people are out in that wilderness hacking away with machetes and pickaxes building settlements, but for some reason most OS folks don't even imagine that roads can go there too.
Anybody near the field has their own list of "taboo" research topics that are very interesting but just don't get funding.
Then you go to the library and realize they have 120 shelf-feet of conference proceedings about Internet QoS that have gone precisely nowhere.
Hence there is no incentive for academic systems researchers to focus on these (it's no fun having paper after paper shot down).