undefined | Better HN

0 pointsPascLeRasc6y ago0 comments

If by ML noob you mean to say that you're like me and have zero formal CS training (as in, I don't know what a data structure is), pickling lets you write your Python workspace to a file just like Matlab's .mat file loading. It's excellent for writing scripts defining different parts of a data pipeline, or just for debugging/trying new things without waiting 20 minutes for something to filter.

0 comments

5 comments · 2 top-level

shantly6y ago· 3 in thread

> as in, I don't know what a data structure is

Basically everything you work with in programming is a value (the number one-hundred-seventy-five, for example: "175") or the address—location in computer memory, say—of a value. You might record the address of that value above as the count of characters from the beginning of this post, for example, were this post the layout of data in some RAM, just as numbering houses on a street. Add the concept of data "width"—how long the number is, in terms of how many characters represent it (3, in this case) and you've got basically all there is in terms of primitive stuff that computer programs operate on.

Observe that the value stored at a location in memory, say, might itself be an address—location—of some other thing stored in memory.

The width+value concept can get you pretty far, in that you can store a bunch of stuff and find it again, given the address of the beginning and some convention that the first so-many bits of the value describe the width of the rest of the value, or some other means of knowing the size (width) of the value, as long as all its parts are stored right next to each other in memory, and in the correct order. That's called an array, in fact, which is a data structure! One problem with arrays is that if you want to make them longer, you might not have more memory available at the end of them—something else may be using that location already, and if you just overwrite it that'll likely break something. So you'll have to copy your entire array to a larger piece of empty memory to add on to the end of it.

EXAMPLE!

Say we have some RAM large enough to store nine things, and we know that we have something stored at position 4—programmers like to order items starting with zero, but I'll refrain because it's not really important here and makes it more confusing. The RAM contains the stuff we're looking for, plus some other stuff that we don't care about right now:

[429317501]

We look at position 4, and know (by convention, or whatever) that the "3" we find tells us how many more locations to read past that to get our entire value, and extract "175" as our value, by proceeding to do just that. That's an array! One of the most basic data structures. Of course all this is binary under the hood, and those binary numbers can also represent letters or color values of a pixel in an image or whatever, but I'm using simple base-10 numbers to keep things easier to follow.

You can use these basic pieces to build up more complex data structures than that, of course. You can have a series of places in memory, not necessarily next to each other, each containing a value and then the address of another value+address pair. Given the address of the first of these, one could write a program to read each in order, following the addresses to hop from one to the next until it finds one without an address provided. Ta-da, it's (one kind of) a linked list! That's another kind of data structure. Now if you want to add to the end of your value, you just pick an empty spot in memory, add its address to the last piece of the existing list (first "walking" the list to find it by starting at the beginning and reading each piece in turn, if you don't already know where the last part is located), then fill it in with the value you want. This takes up more space than an array, though, and may take a little longer to "read" (get the value[s] from). It's very common for many types of data structure to be able to represent the same thing, just with different trade-offs in terms of space used, or time to locate a given part of the structure, or how much space or time it takes to modify it (recall, having to copy an entire array in order to add on to it) and so on.

EXAMPLE!

[950818972]

If we know (somehow) that our linked list starts at position (address) 5, and know that we can expect a value then an address at that location, next to one another, we see "18", so our value starts with 1 and we should look at address 8 next, where we find "72"—value is 7, and now look at address 2, finding "50". Zero in our little make-believe addressing system here conventionally means there are no further addresses to look up (there is no address 0) so, without knowing how long the list would be when we started, we now know we're done, that there were three values stored in the list, and that they are, in order, "1", "7", and "5".

If you've followed this far, you may be able to see how one could make "trees" (pairs of addresses instead of just one, suggesting a "left" and "right" path) and other things from these fundamental parts, and may be able to think of reasons why one might do this. A linked-list where all the "values" are addresses to parts of some other structure (with the "address" portion of the linked list item used as normal)? That's a sort of index, right? A list where the "value" at each position is the address of the beginning of another list? We call that a multidimensional list (or multidimensional array, if it's laid out as an array) and it's one way a person might represent a grid of, for example, colors in an image (this is basically what a bitmap is). And so on.

Disk storage uses the same fundamental building blocks. FAT, as in FAT32 or FAT16, old DOS and Windows file systems? File Address Table is what FAT stands for. There's a table (kinda like conjoined lists, much as above), starting at a conventional spot (address) on a FAT-formatted disk partition, that describes where all the files are on the rest of the disk, along with some other info about the filesystem. It's basically exactly what it sounds like, and uses precisely the same concepts as above, just applied to locations of files on a disk rather than locations of values in RAM.

One last insight: program code—the instructions for the CPU—is also stored in memory, and that works basically the same way as the stuff above. This unified, undifferentiated storage system is called Von Neumann architecture, and it's what pretty much all computers you're likely to encounter use. Point being, those addresses pointing to things stored in memory? They can also point to places where code is stored, which, once located, one might direct the CPU to execute. A little thought on that, combining it with the above notions, should suggest some cool things this would enable.

And that's about it. That's data structures, and indeed much of programming. It's all values, addresses, widths, and more than a little bit of convention.

0wis6y ago

I have never read such a crystal clear explanation of what a computer / a Von Neumann architecture is. Simpler is always better. Thanks a lot for this.

Any other clear descriptions of CS concepts - especially for Data Science - to share ? Links to them ?

shantly6y ago

Haha, thanks. Nah, I wrote that just now and don't have, like, a source I go to for this stuff that's not pretty well-known already, or a blog or anything. Maybe I should start one. Not exactly ready to explain anything else like that off-the-cuff in an HN post at the moment :-)

PascLeRascOP6y ago

I've read a lot about data structures and nothing's ever stuck, but this "did it" for me. Thanks so much, you're a really excellent writer.

streetcat16y ago

Here is a simple explanation:

1. The universe is composed of things.

2. We can use the computer to store information about those things (this is the data).

3. In order to gain useful insights about those things, we want to do operations on the data. I.e. to compute. computation is done by algorithms.

4. Data structures are the bridge between the data about things and the algorithm. They hold the data such that the algorithm will have an easier time computing.

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

shantly6y ago· 3 in thread

> as in, I don't know what a data structure is

Observe that the value stored at a location in memory, say, might itself be an address—location—of some other thing stored in memory.

EXAMPLE!

[429317501]

EXAMPLE!

[950818972]

And that's about it. That's data structures, and indeed much of programming. It's all values, addresses, widths, and more than a little bit of convention.

0wis6y ago

I have never read such a crystal clear explanation of what a computer / a Von Neumann architecture is. Simpler is always better. Thanks a lot for this.

Any other clear descriptions of CS concepts - especially for Data Science - to share ? Links to them ?

shantly6y ago

PascLeRascOP6y ago

I've read a lot about data structures and nothing's ever stuck, but this "did it" for me. Thanks so much, you're a really excellent writer.

streetcat16y ago

Here is a simple explanation:

1. The universe is composed of things.

2. We can use the computer to store information about those things (this is the data).

3. In order to gain useful insights about those things, we want to do operations on the data. I.e. to compute. computation is done by algorithms.

4. Data structures are the bridge between the data about things and the algorithm. They hold the data such that the algorithm will have an easier time computing.

j / k navigate · click thread line to collapse