Ask HN: What are the best tools for extracting tax data from a W2 form?

2 pointsaidangrimshaw5y ago2 comments

I'm working on an open source tax filing web app at https://ustaxes.org/ and https://github.com/thegrims/UsTaxes

Any ideas on best practices for extracting tax data from a W-2 form? I've looked at Microsoft form-recognizer and AWS Textract, but I haven't been able to get good results so far. (caveat I haven't tried either with custom training data)

2 comments

2 comments · 1 top-level

tgflynn5y ago· 1 in thread

Is it still the case that W-2's are usually only provided in paper form ? If they would just e-mail a (non-scanned) PDF you could extract the data easily without having to deal with OCR.

aidangrimshawOP5y ago

Yeah one solution I was thinking about is using something like Tabula to parse to parse pdf text. It's still kind of tricky to match what text matches up to which form label, but is definitely easier than OCR

j / k navigate · click thread line to collapse