- a browser must reject any invalid HTML in order to force the developers to fix their HTML
- a browser must try hard to make sense of messed up HTML, otherwise users will switch to a competing browser that renders the mess for them
Theoretically all browser vendors could coordinate so that everyone rejects invalid HTML, but there is probably no good way to avoid defectors. Why did this not happen for other technologies? My first thought was that there is no compilation step which allows forcing the developer to fix things without giving the end user any power through their choice of browser. But that seems not quite right, why do Bash or Python or your C++ compiler not make a best guess what your code is supposed to do? Because there is or was only one dominant implementation and therefore no competition? Because document markup is much more robust against small errors and probably remains readable while your code likely just crashes? That is probably one of the most important ones, I think. What role did browser specific features, evolving standards and incomplete implementations play?What is the end result? Nothing for the end user, they do not care whether the browser has to deal with nice HTML or a mess. Developer writing HTML get to be more sloppy at the price of a lot of additional complexity and pain where ever code has to deal with HTML. This might actually have some negative impact on end users because of bugs or security issues stemming from the additional complexity. Maybe it made HTML somewhat more accessible to the casual user as they could get away with some mistakes. But was this worth it, could better tooling not have achieved the same with good error messages helping to fix errors?
At the end of the day, HTML's flexibility as a markup language is what made it popular and usable by anyone, and ambiguity is the price we pay for it.
These days the DOM semantics are even less important as everything is done in JS anyway for all but the simplest documents
(START div)Hello (START b)world\b/!\div/
why ever you would want this.But despite all the flexibility offered by SGML and because probably nobody implemented HTML parsing by using a SGML parser, we ended up with malformed HTML documents who's interpretation was essentially defined by whatever the ad hoc HTML parser implementations in browser did.
HTML5 finally abandoned the idea of basing HTML on SGML and a DTD and instead essentially formalized the status quo of HTML parsing and put it into the specification. At least this is my understanding as a non-web developer who gets to work with HTML only occasionally.
> Some authors find it helpful to be in the practice of always quoting all attributes and always including all optional tags, preferring the consistency derived from such custom over the minor benefits of terseness afforded by making use of the flexibility of the HTML syntax. To aid such authors, conformance checkers can provide modes of operation wherein such conventions are enforced.
In the other words it recognizes the benefit from explicit tags, but also recognizes the benefit from optional tags. So they are equally conforming.
Exactly. The parser is designed to parse documents, not code. The document has a structure (like sections, paragraphs, tables, etc). When the structure doesn’t quite make sense, the parser still displays the content (the blog, story, words, etc).
> But was this worth it, could better tooling not have achieved the same with good error messages helping to fix errors?
You need to think about compatibility, especially backwards compatibility. If the standard was so strict that any error resulted in the browser rejecting to parse the document, then as the specification evolved every website would need to be updated. The lack of constraints around standards also means that different browsers can evolve and implement different features at different cadences.
Developers?
In the early days of the web it wasn't developers writing HTML. It was anybody who wanted to publish anything on the web. Real programmers didn't touch it. That is why browsers had to be tolerant of bad code.
Now that HTML5 parsing is well specified, I've come to think that either you want to be strict and have the browser tell you something is wrong, and you use XHTML for this, or all these optional tags are just useless.
I want to optimize readability, and then file size. I believe closing all tags you opened and quoting all attributes helps readability, and also that all these <head>, <body>, <html> tags just get in the way and make your eyes go through useless boilerplate and makes your fingers type useless things too if you don't use templates.
You still need to specify the charset so characters are interpreted correctly, so for me, if you are not going to use application/html+xml anyway this works well:
<!DOCTYPE html>
<meta charset="utf-8" />
<title> My title </title>
<p> Lorem ipsum... </p>
Both quicker to read and write, while not raising maintenance costs.Though just yesterday I edited my resume written in XHTML and the browser actually spotted a dumb mistake, so I still like the strictness of the XML parsing mode.
One counterpoint to dropping the optional tags is for pedagogy: if I had to teach HTML to someone, I would make them use all the tags, or the result of having html and body in the DOM and CSS working on them will be very confusing. Only when they understand the DOM, what nodes are in an HTML page, I'd make them drop the tags if they want. Which is an important step so they can understand that nodes that are present in the DOM are not necessarily in the source code.
<ahahah>mmm... <ohoho>nope.</ahaha></ohoho>
Fully joking :-PI use optional -- therefore abbreviated -- HTML syntax as an alternative to MarkDown for writing.
We need to go deeper. •`_´•
For example emitters need to know what void elements are because <br></br> is actually equivalent to`<br><br>`. But `<script src=foo.js/>` is only an opening tag so the rest of your document will be executed as JavaScript. So you can't just write an emitter for arbitrary elements, you need to emit different things for `br` and `script`. Plus `script` has special escaping rules that are often forgotten about. Plus you better keep that list up to date!
With XHTML it is very easy to write a parser that will construct a tree forever and can reserialize it with no issues. I have no issues with consistent changes such as empty attributes and unquoted attribute values, but I think that these element insertion, auto-closing, void elements and non-replaceable character data are a mistake because you need to maintain an up-to-date dataset of these custom rules or you get an incorrect result.
(Is still a thing, but W3C recommends against it)