"Convolution with a kernel K" describes a system whose impulse response is K. In discrete time, suppose you have K=[1,2] and convolve [0,1,2,0] with it- you wind up with [0,1,3,2,0], if I'm awake enough for arithmetic.
Correlation with a kernel K is convolution with K time-reversed (i.e. [2,1])- you'd get [0,2,5,2,0] (again if I'm awake). Note that 5- right there, the input signal "lines up just right" with the kernel- 2x2 + 1x1. That's why it's called correlation- its output is big when the input looks like the kernel.
It's a binary operator on functions that yields a third function. It has a lot of useful properties and equivalences, like that it can be described as the product of two Fourier transforms (although that's very roundabout).
You're actually introduced to convolution in middle school when you're taught how to multiply monomials to build a polynomial (at my middle school they called it "FOIL").
That doesn't help one understand what it is at all. Convolution in DL is simply a set of dot products of a patch from the input with a bunch of filters. Each resulting dot product is simply a measure of similarity between a patch and a filter. That's all there is to it.
The idea behind convolution in deep learning is that, if a particular pattern of pixels is meaningful, then it is probably also meaningful if you shift the whole thing in some direction. So you can force some layers of the network to be the same under translation, and it'll be faster to pick up some sorts of patterns.
It's faster because its reduces the dimensionality of the inputs down to something manageable (hundreds or low thousands). You can replace convolutions with most other types of dimensionality reduction (including other types of layers) and outside of image tasks you'll get very similar or even better performance.
(Even before that it has been used in signal processing.)