I work on the address-formatting project, one small piece of the many used here. We currently have formatting rules for 93% of the world's 249 territories (as defined by ISO 3166-1 alpha-2 codes), but we need help to finish things out - especially from people with local knowledge and native speakers. Even for the countries we've "finished" more tests are always useful.
Here's the repo if you'd like to get involved: https://github.com/OpenCageData/address-formatting
Here's a post I did a week ago on the regions we need help with, though since then we've started making good progress on Arabic speaking countries. http://blog.opencagedata.com/post/138991962708/an-update-on-...
Feel free to ping me if you'd like to get involved. Thanks.
I'm pretty excited that this is the first time someone has been able to release into the open an address analysis engine like underpins Google's geocoder. You'd be surprised just how hand-tuned most of the other address parsing engines are (regex and case statements all the way down). This feels like a huge leap forward.
Feel free to help out on address-formatting. Where we really need help is eastern Asian countries with double-byte character scripts. Specifically CN, HK, JP, KP, KR, MO, and TW - those countries obviously represent a significant chunk of the world's population, so would be great if we could get help on those from folks with local knowledge.
It uses the Google dataset, which is under the public domain. Might make sense to compare formats?
Btw, I love how your worldwide.yaml, both the deduplication and the redirection for subterritories (such as Vatican City).
And thanks @bojanz for asking Google about their data's license! :)
Haven't looked in details at your addressing PHP module or even Libpostal, but I feel like there should be some ways to deduplicate efforts and converge all datasets. Both for testing and i18n/l10n.
In the mean time, OpenCageData's address-formatting language-neutral YAML structure seems quite nice.
The big challenge I think is the conflicting use cases between "official" postal format of a country and trying to represent an address in a way that makes sense to users - especially when you only have limited data available (for example when using a datasource like OpenStreetMap where you are at the whim of what the mapper decided to add). Our project isn't about forming perfect postal addresses for things like printing labels and such, it's about taking the real world data in OSM and making it look reasonable. As an example one of the next things I want to add is basic rules about postal codes so we can catch garbage that comes in when mappers mistakenly put the town name in the post code tag and such.
Will definitely take you up on the offer of comparing formats. Any further feedback you have would be really useful, you've obviously spent a lot of time thinking about this space.