SciELO - Scientific Electronic Library Online

 
vol.32 número2DDLV: A system for rational preferential reasoning for DatalogDecoding the underlying cognitive processes and related support strategies utilised by expert instructors during source code comprehension índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Articulo

Indicadores

Links relacionados

  • En proceso de indezaciónCitado por Google
  • En proceso de indezaciónSimilares en Google

Compartir


South African Computer Journal

versión On-line ISSN 2313-7835
versión impresa ISSN 1015-7999

Resumen

KOTZE, Gideon  y  WOLFF, Friedel. Exchanging image processing and OCR components in a Setswana digitisation pipeline. SACJ [online]. 2020, vol.32, n.2, pp.218-231. ISSN 2313-7835.  http://dx.doi.org/10.18489/sacj.v32i2.707.

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMa-gick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.CATEGORIES: Applied computing ~ Optical character recognition Computing methodologies ~ Image processing

Palabras clave : digitisation; optical character recognition; image processing; neural networks.

        · texto en Inglés     · Inglés ( pdf )

 

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons