Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero (opens in new tab)

(github.com)

120 pointsdlock172y ago24 comments

Tesseract is one of the largest Open Source OCR (Optical Character Recognition) projects. There is already a Go library for using Tesseract from Go with CGo, called Gosseract.

However if you are interested in OCR from Go without C complicating building and cross-compiling, there aren't any other options.

Wazero is a Go WASM runtime that doesn't have any CGo dependencies. With Emscripten Tesseract has been compiled to WASM and ran within Wazero.

Gogosseract provides a simple API on top of this. This project has been an interesting delve into the world of WASM.

Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero

(github.com)

120 pointsdlock172y ago24 comments

Tesseract is one of the largest Open Source OCR (Optical Character Recognition) projects. There is already a Go library for using Tesseract from Go with CGo, called Gosseract.

However if you are interested in OCR from Go without C complicating building and cross-compiling, there aren't any other options.

Wazero is a Go WASM runtime that doesn't have any CGo dependencies. With Emscripten Tesseract has been compiled to WASM and ran within Wazero.

Gogosseract provides a simple API on top of this. This project has been an interesting delve into the world of WASM.

24 comments

24 comments · 10 top-level

iampims2y ago· 3 in thread

To me, this is the real value of Wasm: platform independent libraries with a standard interface that doesn’t require C.

slimsag2y ago

WASM runtimes miss out on a _lot_ of optimizations that a battle-tested C compiler will perform, and sometimes requires machine emulation (e.g. Go compiled to WASM results in a virtual machine/emulation layer to run Go code.)

It can work, but it's not the fastest thing in the world.

I think languages that make working with C/C++ code much more seamless, e.g. as nice as working with Go code can be, is a better approach. Zig does this well and feels quite natural coming from Go. It can also be used to make CGO cross compilation 'just work' and alleviate many of those pains.

dmos622y ago

I feel like inefficient but convenient has been the default trade-off in so many places during the last couple of decades. WASM is opening the doors for all kinds of new solutions. I wonder what kind of cultures will develop around it, as regards efficiency.

iampims2y ago

Yes, Zig is best in class for C-interoperability.

Go’s FFI support is alright, but I find using WASM/WASI more pleasant.

richieartoul2y ago· 2 in thread

This is awesome and one of the things I’m really excited about with WASM, and specifically Wazero. The Wazero team is top notch. Now someone just needs to do this with zstd and make it fast…

mappu2y ago

There's a pure-go zstd at https://github.com/klauspost/compress - it's likely faster than running the upstream zstd under Wazero.

anuraaga2y ago

Just for reference I did give it a try

https://github.com/wasilibs/go-zstd

Mostly since I hadn't found `compress` supports zstd. Wazero performed reasonably well against the cgo library but was indeed much slower than this proper pure go port.

mappu2y ago· 2 in thread

Another really interesting way to approach this problem would be to adapt wasm2c to emit Go output. It should result in better performance than wazero.

dlock17OP2y ago

You mean this? https://github.com/WebAssembly/wabt/blob/main/wasm2c/README....

That seems like quite an undertaking. But at that point, It would make sense to cut out WASM entirely like https://datastation.multiprocess.io/blog/2022-05-12-sqlite-i...

ncruces2y ago

Disclosure: I'm working on alternative Cgo-less bindings for SQLite, using wazero.

https://github.com/ncruces/go-sqlite3

One of the problems of the modernc approach (IMO) is that they're not just transpiling CPU/compute stuff, but entirely OS/platform stuff.

Each Go file of theirs is a xxx_os_arch.go that starts with 100s of OS-#defines-as-consts, and goes on to transpile fully #ifdefed code.

It also implements antithetical (in Go) stuff like goroutine local storage, because libc pthreads can't live without it.

And all IO is via direct syscalls that will never play nice with the Go scheduler, because again, this is OS level stuff.

WASM defines a cross platform CPU and an ABI, and using that for compute and the bottom OS layer in Go you get (IMO) a nicer end result.

Given the hard task of generating decent code from WASM at load time (wazero's compiler is pretty naive, a better one is being developed, but it will take seconds to generate good code for anything non trivial like SQLite) I wouldn't mind having a solution that translated to Go, or Go ASM, at build time.

tommiegannert2y ago· 2 in thread

Thanks for sharing!

Since OCR is a somewhat slow process, how does the WASM approach compare to running libtesseract in a subprocess and use some IPC layer to talk to Go? It would require a separate C++ compiler, but not CGo.

> one of the largest Open Source OCR

Tangential, but are there others as large as Tesseract? It seems to pop up anywhere I look.

layer82y ago

> Tangential, but are there others as large as Tesseract?

The one serious competition is PaddleOCR, which is faster on GPU, and also works better for Chinese and other non-Western scripts.

There are some newer ML-based projects like DocTR that have been catching up, at least for some use cases.

dlock17OP2y ago

My intentions was a "pure Go" approach, but that is probably more performant.

I imagine just calling the Tesseract CLI from Go would be simplest if that's all you wanted.

abdullahkhalids2y ago· 2 in thread

Is Tesseract currently the best open source OCR library? Best in terms of accuracy.

How much difference is there between Tesseract and the best proprietary solutions?

ianhawes2y ago

Tesseract is the current best open source OCR library.

When looking at the “best” prop solution, there are a few worth mentioning:

- If you are looking for the best OCR to DOCX solution, ABBYY OCR SDK is the front runner. Their OCR engine is not AS accurate as others I’ll mention, but their output engine (I.e. taking data beyond just the character, like bold or underlined or font name) is probably the best in the market.

- Google Document AI/Cloud Vision is probably the best all-around OCR. The 2 flavors determine whether you want to handle scanned PDFs/images (DocAI) or generalized photos (Cloud Vision). I believe they also have some level of training capabilities via Vertex but I haven’t checked it out.

- IRIS OCR.. Meh

- AWS Textract and Azure Vision are worth mentioning as contenders, but just like Google Document AI, they’re cloud based and that may factor into your decision.

- I haven’t tried DocTR or Paddle OCR

abdullahkhalids2y ago

Thanks for the detailed answer.

donatj2y ago· 1 in thread

Oh awesome. I was really hoping a native OCR would pop up but this really is the next best thing and a more realistic avenue.

dlock17OP2y ago

Exactly, I expected to find one but couldn't, so I put together my own. It's not the fastest, but it'll do for my purposes.

honkotime2y ago· 1 in thread

It mentions that this is a rewrite of gosseract, however it is not a drop in replacement, so its more of a separate library in my opinion

dlock17OP2y ago

Technically I said reimplementation. But you are right in that it's not supposed to be a drop in replacement at all.

The only feature missing right now is Bounding Box detection, which I plan to add in the future.

technics2562y ago· 1 in thread

Off topic but in general how does something like this compare to cloud hosted ocr solutions?

layer82y ago

Tesseract is worse than most commercial solutions, and/or requires more pre- and postprocessing.

yklcs2y ago

I wrote a short blog post[1] on this method a while ago. I do think running WASM in embedded runtimes is a pretty good option, but overhead remains high, and WASI remains somewhat fragmented between compilers and runtimes.

I think this method really shines in Go as not having CGo simplifies a lot of things, and as a decently performant JITed runtime exists in the form of wazero.

[1]: https://yklcs.com/blog/universal-libs-with-wasm

breadchris2y ago

this is sick

j / k navigate · click thread line to collapse

24 comments

24 comments · 10 top-level

iampims2y ago· 3 in thread

To me, this is the real value of Wasm: platform independent libraries with a standard interface that doesn’t require C.

slimsag2y ago

It can work, but it's not the fastest thing in the world.

dmos622y ago

iampims2y ago

Yes, Zig is best in class for C-interoperability.

Go’s FFI support is alright, but I find using WASM/WASI more pleasant.

richieartoul2y ago· 2 in thread

This is awesome and one of the things I’m really excited about with WASM, and specifically Wazero. The Wazero team is top notch. Now someone just needs to do this with zstd and make it fast…

mappu2y ago

There's a pure-go zstd at https://github.com/klauspost/compress - it's likely faster than running the upstream zstd under Wazero.

anuraaga2y ago

Just for reference I did give it a try

https://github.com/wasilibs/go-zstd

Mostly since I hadn't found `compress` supports zstd. Wazero performed reasonably well against the cgo library but was indeed much slower than this proper pure go port.

mappu2y ago· 2 in thread

Another really interesting way to approach this problem would be to adapt wasm2c to emit Go output. It should result in better performance than wazero.

dlock17OP2y ago

You mean this? https://github.com/WebAssembly/wabt/blob/main/wasm2c/README....

That seems like quite an undertaking. But at that point, It would make sense to cut out WASM entirely like https://datastation.multiprocess.io/blog/2022-05-12-sqlite-i...

ncruces2y ago

Disclosure: I'm working on alternative Cgo-less bindings for SQLite, using wazero.

https://github.com/ncruces/go-sqlite3

One of the problems of the modernc approach (IMO) is that they're not just transpiling CPU/compute stuff, but entirely OS/platform stuff.

Each Go file of theirs is a xxx_os_arch.go that starts with 100s of OS-#defines-as-consts, and goes on to transpile fully #ifdefed code.

It also implements antithetical (in Go) stuff like goroutine local storage, because libc pthreads can't live without it.

And all IO is via direct syscalls that will never play nice with the Go scheduler, because again, this is OS level stuff.

WASM defines a cross platform CPU and an ABI, and using that for compute and the bottom OS layer in Go you get (IMO) a nicer end result.

tommiegannert2y ago· 2 in thread

Thanks for sharing!

> one of the largest Open Source OCR

Tangential, but are there others as large as Tesseract? It seems to pop up anywhere I look.

layer82y ago

> Tangential, but are there others as large as Tesseract?

The one serious competition is PaddleOCR, which is faster on GPU, and also works better for Chinese and other non-Western scripts.

There are some newer ML-based projects like DocTR that have been catching up, at least for some use cases.

dlock17OP2y ago

My intentions was a "pure Go" approach, but that is probably more performant.

I imagine just calling the Tesseract CLI from Go would be simplest if that's all you wanted.

abdullahkhalids2y ago· 2 in thread

Is Tesseract currently the best open source OCR library? Best in terms of accuracy.

How much difference is there between Tesseract and the best proprietary solutions?

ianhawes2y ago

Tesseract is the current best open source OCR library.

When looking at the “best” prop solution, there are a few worth mentioning:

- IRIS OCR.. Meh

- AWS Textract and Azure Vision are worth mentioning as contenders, but just like Google Document AI, they’re cloud based and that may factor into your decision.

- I haven’t tried DocTR or Paddle OCR

abdullahkhalids2y ago

Thanks for the detailed answer.

donatj2y ago· 1 in thread

Oh awesome. I was really hoping a native OCR would pop up but this really is the next best thing and a more realistic avenue.

dlock17OP2y ago

Exactly, I expected to find one but couldn't, so I put together my own. It's not the fastest, but it'll do for my purposes.

honkotime2y ago· 1 in thread

It mentions that this is a rewrite of gosseract, however it is not a drop in replacement, so its more of a separate library in my opinion

dlock17OP2y ago

Technically I said reimplementation. But you are right in that it's not supposed to be a drop in replacement at all.

The only feature missing right now is Bounding Box detection, which I plan to add in the future.

technics2562y ago· 1 in thread

Off topic but in general how does something like this compare to cloud hosted ocr solutions?

layer82y ago

Tesseract is worse than most commercial solutions, and/or requires more pre- and postprocessing.

yklcs2y ago

I think this method really shines in Go as not having CGo simplifies a lot of things, and as a decently performant JITed runtime exists in the form of wazero.

[1]: https://yklcs.com/blog/universal-libs-with-wasm

breadchris2y ago

this is sick

j / k navigate · click thread line to collapse