However, if you have a variety of ways to get at your data by this key for this situation, or this key for that one, or if you have large amounts of redundant data that you'd like to be updated "all at once", such as what salesperson is assigned to the Northeast region for all customers, then you may find a normalized form is more efficient.
The big learning is YMMV: for years, relational was a religion. Snowflake and other designs showed that there was a middle ground... and nosql has shown that non-relational designs can work well also, esp in many online and large-data situations. The trick is to pick the right one for the expected workload your app will be facing, not to adhere to any religious dogma (3nf, nosql, or anything inbetween)
But for me the exciting part is when the fervor dies down a bit, and there is a clear understanding of "opposing" tools' strengths and weaknesses. That is when interesting tech and techniques are developed to handle both based on the strengths.
I think we are starting to get to that point with relational and nosql. Tools are starting to emerge on the relational DB side to allow more nosql-like structures within traditional DBMS systems. I think in the coming year or two we'll start seeing a lot more interesting tools to mix relational and non-relational datastores. Currently the common way of mixing is to use kv or document stores as a caching layer for relational data, or to have relational systems separate from nosql systems, with custom code to select from each. My hope tho is that there will be more uniform access methods developing, allowing the relational and non-relation data to have a common, well understood access API layer.
For example I have a project where the core data consists of users, certain user objects, and relations between users, between users and their or others objects, and between objects, basically shaped like a graph. Using neo4j + gremlin makes dealing with this portion of the app trivial - the graphdb literall solves problems for me. But another aspect of the app is relational data on objects related to user records, basically it is shaped as rows where n objects' rows need to be merged, filtered and then operated on (aka JOINed). SQL makes this set of operations trivial, in this case a DBMS literally solves problems for me. But, as it stands, I need to query one DB, get the object record pointers I need, then connect to a different DB, and query the relevant tables if I want to let the relevant DBs do the work. This involves different APIs, weird connection management, and sysadminning two different database environments.
Essentially I'm hoping someone smarter then me will come up with a way for me to:
SELECT MY_COLLATE(d1.a, d2.b) FROM
GREMLIN{find object o1}.data as d1 JOIN
GREMLIN{find object o2}.data as d2 ON
d1.col == d2.col
WHERE conditionsNormalized data is one of the important elements of relational database modeling, and almost all relational databases expect the data to be normalized, so they have optimizations in place to make doing those lookups fast. It will usually create some automatic special indices for the foreign keys or something.
The first thing to do is use a normalized data model and profile it. If it's too slow add some indices and profile it again. If it's still too slow, update the table to store redundant information and profile it again.
"Normalize till hell, denormalize till scale"
Or some thing like that :).
Every other aspect of software development using RDMSs will attempt to challenge decisions made for correctness' sake.
I probably live in my own little bubble, but only lately have I realized that NoSQL has two audiences: (1) People for whom normalization can't work because of their application's characteristics and the limitations of current hardware. (2) People who just don't understand basic relational concepts, or why they were invented in the first place.
It's kinda sad. I've consulted on projects where people implemented sharding before adding any indices to MySQL.
The thing about being in group (1) is that you can also recognize when the ground shifts beneath your feet. Artur Bergman is one of those guys.
The catch is that very few people end up in group 3 and still have the cost of running/administering the rdbms over the 'nosql' one actually matter.
So the group (0), which are mostly db developers of the client/server architecture; when they attack the web2.0 problems, they fail because they stick to dogmatic notions as if they are true. Though (2) people are ignorant of relational concepts, (0) people are stubborn, uneducatable people who end up creating all types of scalibility problems. They overuse the notions of normalization, but forget they attack the wrong problem with the wrong tools.
PS: I am not yet using NoSQL in production and have a solid past in Oracle/Db2/Ms SQL/Sybase, and now now doing startups in MySQL/Postgres and Mongo.
Having said this, I like working with MongoDB. I like schemaless design and flexibility.
For this subject, I usually send people to Bill Kent's Guide to the Five Normal Forms: http://www.bkent.net/Doc/simple5.htm
- The normalized schema assumes that the location of a tournament will never change. If the 2014 Australian Open were to be held in Sydney, then looking up the location of the 2012 Australian Open would thereafter yield "Sydney".
- Population has the same problem. It's not likely that the population of a city will remain constant over time. The original database would give you the population at the time the tournament was held. The normalized version always gives you the current population of the city.
A good example of this is US ZIP codes. In theory (I think), a ZIP uniquely identifies a city and state. But in practice no one stores it this way because the maintenance of that data (by all parties: some people don't know zips, or type them incorrectly) is more expensive than the savings from normalization.
Uh, relational databases are a well established field. The different normal forms, among other useful things, provide a guide to way to prevent specific kinds of logical errors that can otherwise appear during inserts, updates and deletes. There's a lot here.
DRY - "don't repeat yourself" - is simply a slogan that doesn't provide specific procedures for making things compact, how you do it or when its appropriate. What you're saying is a lot like "abstract algebra is just a special case of thinking logically" - sort of true but mostly deceptive if it lets you dismiss a lot of accumulated knowledge with "just think logically".
Incidentally, did you cover 4NF and 5NF in the course? IMO, you'll almost never need 6NF.
Formalisation of your normalisation! Applied Mathematics for Database Professionals is on the reading list, looking forward to it.
The problem is that both population and tournament city could be dependent on year.
Population (obviously) changes from year-to-year, so either his original data is incorrect or he's recording a city's latest population along with all historic tennis championship winners -- neither interpretation makes a lot of sense. Why not use something unlikely to change such as either elevation or country?
Likewise, tournament city is not a fixed value for a tournament. For example, the Australian Open, which he uses as one of his examples, has also been set in Sydney, Adelaide, Brisbane, Perth, Christchurch (NZ), and Hastings (NZ).
To resolve this, he would either have to introduce a two column primary key (tournament, year), or pick some simpler data. I suggest the latter.
Even with all of this criticism, I think it's one of the cleanest introductions of the normal forms for beginners.
I've replaced city and population with country and area to reduce confusion.
I clicked on your CV just because the link was available on the side, and I was shocked to find out that you're still in high school. There are professionals that would have taken 10x as long to explain it only half as clearly. For context, I've been in the business for ~25 years. Keep up the great work.