undefined | Better HN

0 pointsGroxx12y ago0 comments

Well sure, but then why don't we just call it "lossy GZIP"? OCR is a pretty specific subset, and produces characters - this does not produce computer-readable characters, therefore not OCR.

0 comments

6 comments · 1 top-level

ToothlessJake12y ago· 5 in thread

What are you on about? What does it produce if not computer-readable characters? Computer illegible characters? Are you saying it cannot read from the dictionary it creates? Or from the characters it is later optically recognizing off that dictionary?

Again from the JBIG2 wiki[1]:

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded.."

It seems not only is JBIG2 being deployed as OCR by Xerox for whatever reason, its implementation in this case is an absolute failure.

[1] http://en.wikipedia.org/wiki/JBIG2

GroxxOP12y ago

Does it produce ASCII? UTF? If no, it's not OCR.

edit: by the definition you seem to be going on, any facial recognition is also OCR, since you could consider a face a 'glyph' (edit: 'symbol'). The only 'text' thing here that I can see is that it is intended to be used on text, which lends some optimizations, nothing that it's actually text-based in any way.

Dylan1680712y ago

If you make a font out of faces and use them as repeated glyphs then yes it's OCR. If you're not using identical symbols over and over than I don't think you have a sane definition of 'glyph'.

eCa12y ago

It produces symbols, not characters.

Say that the scanner internally splits the scan into regions of 10x10 pixels that it saves in memory. If another region differs on less than (say) 10% of the pixels it is assumed that the two zones are identical and the first one is used in the second place too. The regions have no semantic meaning.

OCR translates the scan into a character set.

Dylan1680712y ago

The only thing that's missing is a mapping from 'symbol #28' into 'ascii #63'. Internally it's storing instances of symbols plus font data for those symbols.

Also, something to think about: an EBCDIC document accidentally printed as ASCII/8859-1 would have equally zero semantic meaning when fed into an OCR program. But I don't think anyone would argue it wasn't OCR.

1 more reply

bestham12y ago

OCR per definition gives out text. Not binary data that resemble the bitmap of the input image.

j / k navigate · click thread line to collapse

0 comments

6 comments · 1 top-level

ToothlessJake12y ago· 5 in thread

Again from the JBIG2 wiki[1]:

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded.."

It seems not only is JBIG2 being deployed as OCR by Xerox for whatever reason, its implementation in this case is an absolute failure.

[1] http://en.wikipedia.org/wiki/JBIG2

GroxxOP12y ago

Does it produce ASCII? UTF? If no, it's not OCR.

Dylan1680712y ago

If you make a font out of faces and use them as repeated glyphs then yes it's OCR. If you're not using identical symbols over and over than I don't think you have a sane definition of 'glyph'.

eCa12y ago

It produces symbols, not characters.

OCR translates the scan into a character set.

Dylan1680712y ago

The only thing that's missing is a mapping from 'symbol #28' into 'ascii #63'. Internally it's storing instances of symbols plus font data for those symbols.

1 more reply

bestham12y ago

OCR per definition gives out text. Not binary data that resemble the bitmap of the input image.

j / k navigate · click thread line to collapse