1
Second trial, you use a alpha-numeric whitelist and split on anything else, but what about umlauts? What about hebrew or cyrillic?
Third trial: split on characters < 32, whitespace and interpunction characters; this works somehow but is ugly. What would you do to get keywords from a string?