Eliminating manual data entry: Using OCR to convert images to text (Tesseract.js + React)

Shane Kins
Panya Studios
Published in
6 min readMay 2, 2019

--

We’re all about setting high standards for our processes, constantly building new features, and refining older ones. We love experimenting with new tech, so if we’re tearing it up with our sprint tasks and get some time to spare, we think about how we can improve the rest of our stack.

The Challenge

We were faced with an interesting obstacle — payout processing — A user will send specific data (e.g. codes, total amount of money requested, etc) to us to withdraw money from their Panya account, and because real money is involved, we need to cross check this data with our database as accurately as possible to ensure these requests were valid.

But some of these requests were unique as they were submitted via a third party, rather than via our app. We’ve built a robust custom admin system to handle things like this, and is optimised to handle payout requests easily and efficiently as long as they are submitted via the app as intended, but not via other means. How could we improve this…?

The Problem

By submitting it via this third party, we are left with very little useful information to process these requests efficiently. We would receive these requests in the form of scanned documents with zero textual data, alongside a spreadsheet with zero information we can use to actual query the data.

So we resorted to manual data entry to organise these requests. This is, as you can image, far from ideal…

A Possible Solution!

So we have all of these scanned documents with information we need. We can’t copy and paste, we can’t write a script to fetch the data we need, all we can do is manually enter the information from the scanned documents into a spreadsheet. Or, we can use…

OCR!

What is OCR?

OCR stands for Optical Character Recognition. It’s a process of converting an image to machine-encoded text. For example, we could get a scanned image of a book, and use OCR tech to read the image and output text in a format we can use on a machine.

Tesseract.js example: https://tesseract.projectnaptha.com/

It’s a neat concept! The possibilities here are huge! For us, it could drastically improve how we process some of this data.

What are we going to create?

We’re going to build a single page react app that will accept multiple image uploads, process them via Tesseract.js, and produce an output showing both image text content and results.

Step 1: Setup our app

We’re going to bootstrap our app using create-react-app, which will give us a solid foundation to get up and running quickly. You can setup a create-react-app project instantly by running:

npx create-react-app my-ocr-app
cd my-ocr-app
npm start

You now have a bare bones react app up and running! You can view the page at http://localhost:3000. You should see the following page:

Step 2: Install Tesseract.js

Within our project directory, add Tesseract to our dependencies by running the following command:

npm i tesseract.js

Step 3: Setting up our template

Lets breakdown what we want to show on the page:

  • Image uploader
  • Thumbnail preview of uploads before processing
  • Table of results after processing

The first thing we’ll do is setup a simple template to start with. In your src folder you’ll see App.js and App.css (alongside various other files). Go to ech file and replace all of the contents within with the following:

This is the base template without any logic applied. We will update this further as we progress.

Save both files, and you’ll notice the page has automatically updated. You should see this neat little template:

Step 4: Creating our logic

NOTE: This how-to guide is focused on output, rather than best objective react coding practices. This may or may not be the best way to structure a feature like this, so if there are any improvements that can be made, please share in the comments!

We’ll be taking advantage of React’s easy to use state handling to manage our flow. So let's set the foundations of our component state first. Go to App.js, and add our constructor just above our render method:

  • uploads will store images we upload via our uploader.
  • patterns will store output based on a pattern (which we will define later).
  • documents will store the full output data returned from tesseract.js

Now let’s tell our file uploader how to handle the files we throw at it. Below our constructor, add the following method:

The, go to our render method where our JSX is, and update the File uploader section with the following:

So let’s break this down:

  • Our new handleChange method takes in all uploads via our #fileUploader input, and stores each upload in our uploads state.
  • In our JSX render, we’ve added an onChange parameter to our #fileUploader input field. This will trigger the handleChange method every time the uploader in initialised.
  • Also in our JSX render, we’ve added a map function to return a preview image for every file we upload.

Save your files, go to your homepage and upload one or more images. You should see some previews appear!

Now let's set up our conversion logic. Our app is accepting uploads, so now we want to put these uploads through tesseract.js and generate some output.

Below the handleChange method we just created, add this new method:

And lets update our results JSX to show our results. Add an onClick parameter to the generate button to call the function, and replace the results section we set up earlier with this:

Okay! We should have enough to produce some results! Let’s go through the above code a bit first:

  • Our generateText method will take the uploads currently stored in our uploads react state, loop through each upload, and run it through tesseract.
  • Tesseract will process each image, and return a confidence score, text result, and pattern result.
  • Confidence is a 0–100 score of how accurate the conversion was.
  • We also defined a regex pattern so we can pick out words matching a specific pattern. This would help us find needles in a haystack if needed. The pattern here will take all words that are exactly 10 characters in length.
  • At the end of each loop, the result is stored in our documents state.
  • In our JSX, add an onClick event to our generate button to call the generateText method.
  • Replace the JSX results section with the code above — we call a map function to loop through our documents, and return the result on the frontend.

Step 5: Test!

Upload some image documents with some text, hit that generate button, and watch the results roll in!

NOTE: If tesseract errors are being returned after clicking the generate button, try adding tesseract via CDN inside the document head at public/index.html:
<script src=”https://cdn.rawgit.com/naptha/tesseract.js/1.0.10/dist/tesseract.js”></script>

Conclusion

This is a bare bones example of how tesseract works. This type of app can be expanded upon by taking full advantage of tesseract’s API (e.g. loading bars, character whitelisting, different languages, etc).

Check out the API docs here: https://github.com/naptha/tesseract.js#tesseractjs

So, did it work out for us?

Unfortunately, the output was not accurate enough for us to use in this scenario. This was due to various issues:

  • Some documents given to use were not in good condition
  • Fonts used in some documents would confuse tesseract.js
  • It couldn’t handle both English and Thai at the same time.
  • At the time of writing this, tesseract.js cannot be trained to improve accuracy.

While it didn’t solve the particular issue we were having, it was still a super fun micro project to work on. Tesseract.js has it’s limitations, but it is just a port of the more sophisticated Tesseract OCR Engine, and we like to think it will only get better from here!

P.S. We’re hiring! Panya Studios is always on the lookout for talented and passionate individuals to join our growing team in Bangkok, Thailand. Explore our current openings at https://studios.panya.me/

--

--

Shane Kins
Panya Studios

Just another Fullstack Javascript/Typescript guy! With a focus on clean architecture and TDD. Currently based in Bangkok, Thailand.