The data follows the relation model but just updating one field after dumping it into postgresql takes quite some time (doing a update on join) and I'm not sure this is the most effective tool/way for this kind of work. The only queries that will be run is to update or doing inserts/append new data to existing tables (eg older files).
Do you have any suggestions to look into for a workload like this?
The workflow is pretty simple: mass update existing records or appending new records once in a while (this will be done in a controlled fashion invoked by someone - no services) -> do some ETL work -> re-build existing tables from source after doing some join magic -> export tables for delivery to customers.
Tables have around 10 to 100 million rows and 10 to 300 fields (will probably normalize a bit..) The data is relational and as for now the size is ~400GB which I expect to grow around 3% each time. There wont be much reads as the tables will be exported for further process and the the writes will be in a controlled fashion as stated above.
I'm looking at mysql and postgresql - which one would you choose? Or something else?
Here is my ordered list: 6.0001 Introduction to Computer Science and Programming in Python 6.0002 Introduction to Computational Thinking and Data Science 6.S096 Introduction to C and C++ 6.001 Structure and Interpretation of Computer Programs 6.005 Software Construction 6.042J Mathematics for Computer Science. 6.006 Introduction to Algorithms (Fall 2011) 6.046J Design and Analysis of Algorithms. --- 6.824 Distributed Computer Systems Engineering 6.828 Operating System Engineering
Obviously it is a bit light on the mathematics and I will probably add more courses (or not), so that might change.
Any comments/remarks on this list?
Currently I know the basics of C, python, Java and SQL but I feel like I'm missing some more formal education on the subjects.
[1]https://news.ycombinator.com/item?id=14514686