I'm working on an open source tax filing web app at https://ustaxes.org/ and https://github.com/thegrims/UsTaxes
Any ideas on best practices for extracting tax data from a W-2 form? I've looked at Microsoft form-recognizer and AWS Textract, but I haven't been able to get good results so far. (caveat I haven't tried either with custom training data)
Is it still the case that W-2's are usually only provided in paper form ? If they would just e-mail a (non-scanned) PDF you could extract the data easily without having to deal with OCR.
Yeah one solution I was thinking about is using something like Tabula to parse to parse pdf text. It's still kind of tricky to match what text matches up to which form label, but is definitely easier than OCR