Also, how do CNNs draw a box around the target in the image?
You can, of course, tell the network to output whatever you want: all of the guesses, best guess, top five guesses, all guesses over a threshold, etc.
Note, this is a gross oversimplification, but it gets the general concept across.
In the past, some people did this inefficiently by just sliding a window across the image and using the same classifier you'd use for the first problem. But this is inefficient, and different sizes make it more inefficient. So the best solution is to use an "Object Detection" network, look into SSD or YOLO to see an example of this.
Effectively. That's what a DNN is, ANNs with multiple inference layers each one gives their "highest probability" and then the client/system sets the weight threshold for what is returned.
Is this just the article's over-simplification or are these values really just randomly selected?
The values in the filter matrices and the weights and biases of the fully connected layers are truly random though. They are often initialized with Gaussian random values. Sometimes they are just initialized as all 1's, or 0's. Again, there's no "right" answer (there is probably research out there that recommends one initialization approach over another). These are the values that are trained using gradient descent.
To this point, the article is certainly NOT intuitive if you don't already understand image convolution. The explanation is also very long and rambling. While I understand the author has made an effort, I don't think the article really presents the subject matter in a new way: I can learn all of this elsewhere. This is a common problem when people write about complex subject matter without fully understanding the knowledge gap between teacher and audience.
If I were the author, I might try to read up on technical communication and spend some time figuring out how to correctly simply something. As it stands, this article using the typical strategy of information hiding to simplify the subject matter. The problem is that information hiding doesn't doesn't work very well unless it is expertly done. I do like the animation, but again, it only serves to show how image convolution works, and doesn't actually teach us anything about a CNN.
I would suggest the author break the document into three separate sections, the first being very simple (maybe start with the part that says 'images are just matrices') and then add more details in each section. The final section would have a lot of detail. That way you counteract the information blindness that occurs from simplification by providing the information later.
Otherwise, this article is really more of a data dump than an intuitive explanation, and since it doesn't really teach us anything we can't learn elsewhere, I don't see what it contributes.
A cleaner explanation, expertly prepared, could really elevate the effort that went into this.
I think it is a good article/blog post, (thanks dude, whoever you are that wrote it).
You on the other hand didn't give any better alternatives on your "rant".
I stand by my comments.
> You on the other hand didn't give any better alternatives on your "rant".
I don't have to provide better alternatives. Note that my response did provide suggestions on how to improve the article.
you on the other hand haven't said literally anything except vaguely criticized. look i'll show you how it's done:
>The explanation is also very long and rambling. While I understand the author has made an effort, I don't think the article really presents the subject matter in a new way: I can learn all of this elsewhere. This is a common problem when people write about complex subject matter without fully understanding the knowledge gap between teacher and audience.
these two sentences have nothing to do with each other: that the explanation isn't novel has nothing to do with elided gaps between expositors and readers (wherein usually the exposition is too complex, not too simple as you've confused it).
>If I were the author, I might try to read up on technical communication and spend some time figuring out how to correctly simply something.
vague. read from where? which chapters? simplify which parts?
>I do like the animation, but again, it only serves to show how image convolution works, and doesn't actually teach us anything about a CNN.
it's like you think that one animation should explain the entire CNN. did you actually read the post? that image explain convolutions and is the absolute standard explanation for convolving with a filter/kernel.
>I would suggest the author break the document into three separate sections
better in that at least it's concrete advice. i suggest you include more points like this.
>images are just matrices
are you suggesting the author goes into CCDs? ADCs? now that would be a rambling post.
>That way you counteract the information blindness that occurs from simplification by providing the information later.
that's terrible advice. detail should be evenly distributed through the article. look at any journal article: except for the appendices all of the meat is in the body not in the conclusion.
>Otherwise, this article is really more of a data dump than an intuitive explanation,
a data dump would be just code. this is in fact an intuitive explanation that uses the classification of dogs/cats/boats/bird as the framework, so there's a structure, terms are defined, there's context (lenet etc.), and there are references.
>and since it doesn't really teach us anything we can't learn elsewhere, I don't see what it contributes.
blog articles don't need to be novel.
https://stats.stackexchange.com/questions/154798/difference-...
The kernel of a filter would be it's impulse response, which is what you convolve by to get the filter response. That's where the sloppy terminology comes from. A kernel though does not need to be a filter.
A kernel is a function whose product maps a point in one domain onto another domain. For example, the Fourier transform has a kernel of e^jwt. The integral (or sum if discrete) of these products over the function is the transform because it maps the entire function into it's new space. A filter is a function typically defined as having product behavior in the frequency(transformed) domain, which is equivalent to convolution in the time(original) domain. A window is a function that has product behavior in the time(original) domain, and thus convolution behavior in the frequency(transformed) domain.
Particularly in linear algebra (matrix math), if something is a kernel function, there are certain mathematical implications.
Another confusing bit here is that the convolution they are performing to project the original function (the larger image matrix) onto the smaller one isn't a proper convolution, it has a hidden window function in the way the operation is being performed to restrict the output to only the fully overlapped area of an otherwise linear 2D convolution. This is typically called a cropped convolution in image processing.
i hit up google scholar occasionally looking for references, but literally everything seems to be applying them to 2D images.
What's even the difference between 1D inputs and 2D inputs? It's all a bunch of numbers anyway. I don't think it really matters if the pixels are arranged (as you see them) in a neat rectangle vs in a straight line. You could take a 2D matrix and enumerate it as a linear string of numbers and it would still be the same matrix, just represented differently. I don't think the CNN cares either way.
I would go as far as saying that the 1D-ness of the input is just "in your head".
If you arbitrarily represent a signal as a 2D matrix, then abrupt changes in the gradient on the vertical axis are meaningless. But the same is not true in an image, which is naturally represented as a 2D matrix. Here, a sudden change on the vertical axis usually corresponds to an edge in the image.
If you represent an image as a 1D array, you throw away spatial information. So I'm not sure about the 1D-ness just being in ones head.
edit: I'm not sure if you're asking specifically for examples of CNNs applied to linear image sensor data, or if you're asking whether CNNs have been applied to any 1D input data.
Of course, andrej karpathys Stanford lecture on the subject is as well.