Not sure how much that matters - I'm not an AI expert, but I did some intro courses where we had to train a classifier to recognize digits. How it worked basically was that we fed each pixel of the 2d grid of the image into an input of the network, essentially flattening it in a similar fashion.
It worked just fine, and that was a tiny network.
The classifier was likely a convolutional network, so the assumption of the image being a 2D grid was baked into the architecture itself - it didn't have to be represented via the shape of the input for the network to use it.