Due to the project not being written in Java, we obviously needed a way to communicate with CoreNLP without the cost of calling java for every job.
The solution in the end was to fork an existing (Java) project that wraps CoreNLP and exposes it over HTTP.
The fork adds:
- support for JSON (the original was XML only)
- packages it for Debian
- defaults to "today" for relative dates
- adds a utility class 'RegexNERValidator' to allow testing/quoting of mapping files for CoreNLP's RegexNER, to allow checking a file can be used by RegexNER before the main CoreNLP process is restarted.
The result is at https://github.com/Koalephant/StanfordCoreNLPHTTPServer
Please note: I do not usually work in Java, so I'm well aware there are likely better ways to achieve some/many/all things this project does. If you feel inclined to improve it, send a PR (preferably with some indication of why its an improvement, if its not an obvious bug/feature improvement).
There are a couple of courses I've heard are fantastic. The first that I'm going through right now is by Michael Collins:
https://www.coursera.org/course/nlangp
The other is by Dan Jurafsky:
It tries to find relevant multilingual information from Wikipedia and Wiktionary when you click on words in a submitted text.
Eventually I want to expand it to other target languages beyond English.