Long answer: Colour is a very rabbithole topic but Captain Disillusion has a summary of it (https://youtu.be/FTKP0Y9MVus) and Technology Connections has a discussion (https://youtu.be/uYbdx4I7STg).
We can notice that when people say they perceive "yellow" that the spectral intensity graph has certain patterns. This is the physical phenomenon that produces the sensation of "yellow."
Humans are not good at judging reality introspectively. We experience everything heavily filtered through a variety of lenses. Our feeling that color is "concrete" is not predictive or explanatory... we cannot build mechanisms based on it. The idea that our perception of color is a result of interactions between certain wavelengths of light and certain photosensitive tissues in our eyes is both predictive and explanatory. We can design systems that have similar types of wavelength intensity sensitivity components and measure the physical response of those systems. That's how cameras work.
We can reverse the process and take those measured wavelength intensities and re-emit them from variable-wavelength light sources and produce images. That's how you're reading what I've typed right now - the images produced by the display you're looking at were generated in this fashion.
I'm not sure what you mean by the “wavelength theory” of color perception.
Of course we can. We can capture the signal sent through the optical nerve and then reproduce it as a stimulus which will make the brain “see” yellow color.
Besides, humans are capable of distinguishing literally millions of colors, of which just a tiny fraction can be attributed to measuring particular wavelengths (or, more accurately, particular energies of the incident photons). In that way the eye is different from the ear (which performs a kind of Fourier analysis of the sound wave).
The human eye has four basic cell types, rod cells and cone cells, and there are three subtypes of cones, short, medium, and long. The three subtypes of cone cells sense blue, green, and red light more or less directly. Medium and long cone cells, which directly detect green and red light, almost entirely overlap. [0] It is more accurate to say that long cone cells detect yellow light than it is to say it detects red light. There is a brain system which measures the difference in response between the long (red) and medium (green) cells and uses the difference to say "aha! this must be red!"
The ratio of short (blue) medium (green) and long (red (yellow)) cone cells are roughly 2%, 2/3, and 1/3. The cells in your eye which detect blue light are more or less a rounding error. The cells which detect green light are roughly twice as numerous as the cells which detect red (well, yellow) light. If you see a thing and think, "man, that's awfully blue," it's not because your eyes are telling you "hey, this thing is awfully blue". The "blue" signal is barely noticeable in the overall signal; but your brain jacks up its responsiveness to the minuscule blue signal.
One of the side effects of the completely fucked ratios between the three types of cones is that your perception of the overall brightness of a thing is mostly down to how green it is. This shows up in lots of standards; NTSC, JPEG, the whole nine yards. If you've ever implemented a conversion between RGB and any luminosity-chroma colorspace (YUV, YCbCr, YIQ, NTSC, any of them) there's a moment where you'll go "wait a minute this doesn't make any fucking sense". You look at the numbers and the luminosity channel is just... green, and you know that the other two chroma channels are quartered in resolution. And you'll think that makes no sense. But that's how it works.
Then you'll remember that color sensors have their pixels arranged in groups of four, with two green, one red, and one blue channel. There must be some green conspiracy.
And there is. It's your brain. It's your eyeballs with 2/3 of its cone cells being green sensitive ones.
Those are your cone cells. Rod cells are entirely different. It's trivial to say well, cone cells see color, rod cells see black and white, but it's more complicated than that. Rod cells are excellent in low light conditions, cone cells not so much. Cone cells see motion very well, rod cells not so much. Cone cells can discern fine detail, rod cells do not. Rods and cones are not evenly distributed across the retina either; cone cells are densely packed in the center, rod cells are more common in peripheral vision.
Look at a colorful thing directly; take a note of how colorful it is. Now look away from it, so it's only in your peripheral vision; take a note of how colorful it is. Does it seem just as colorful? It isn't. That's your brain fucking with you. Your brain knows it's in your peripheral vision and all the colors are muted out there, so your brain exaggerates the colorfulness. Cone cells are 30 times as dense in the center of your vision as they are just outside the center of your vision. [1] That's why you can read a word directly where you're looking but it's very difficult to read elsewhere.
The reality is that your retinas give a fucking mess of bullshit to your brain, and the brain is the most incredible image processing system conceivable. It takes bullshit that makes no damn sense and -- holy shit I forgot to talk about blind spots.
Ok, so your rods and cones have a light sensitive thing, with a wire in the back, and all the wires get bundled up in the optic nerve that goes to the brain. Here's the thing: they're fucking plugged in backwards. The wires go forward, and are bundled up between your retinas and the stuff you're looking at. The big fat optic nerve therefore constitutes a large chunk of your vision where you can't see anything. Your brain just.. invents stuff where the optic nerve burrows through your retina.
Other weird stuff. If it's bright, the rods and cones send no signal, if it's dark, they send a strong signal. It's inverted. There's apparently a very good reason for this but I don't remember what it is. Also, the rods continuously produce a light sensitive substance that amplifies the light sensitivity but is destroyed in the process. It takes a long time to build up a reserve. This is why it takes time to "build up" your dark vision, and why it's so easily destroyed by lighting a cigarette. The physiology of "ow it's bright" as opposed to "it's bright" isn't just on your retinas, it's also on your eyelids and your iris, but more importantly, it's shared between your two eyes. This is why closing one eye makes it less painful when you go from a dark place to a bright place.
The point is, the study of human vision is not the study of the human eye. The study of human vision is the study of the human brain.
Much of what we do with color spaces and image compression is dictated by our stupid smart eyeballs and our stupid smart brains. Video codecs compress with 4:2:0 chroma subsampling because the brain's gonna decompress that shit better than a computer can anyway. Cameras have twice as many green sensitive pixels as blur or red pixels because the eye resolution is much sharper in green than other colors. More advanced image and video compression schemes will try harder to account for human eye-brain physiology.
[0] https://upload.wikimedia.org/wikipedia/commons/0/04/Cone-fun...
[1] https://upload.wikimedia.org/wikipedia/commons/3/3c/Human_ph...
The reason is to prevent light fatigue in eyes. Ears and nose experience a quick fatigue when exposed to the same stimulant for a long time. With inverted arrangement in eyes, you have a naturally stimulated inhibition rather than a fatigue inhibition.
After you get done exploring how we perceive colors associated with different wave lengths of light, and how nobody really knows whether these are common somehow, or unique to each of us, that sentence should bring you both a chuckle and some wonder about perception.
I am inclined to believe it is, but we do not really know.