In this post I’ll show you how to use Google Cloud Platform’s Vision API to extract text from images in R.

I’m mostly interested in the case when the text being extracted is handwritten. For typewritten or printed text I use Tesseract.

Is it feasible to use Vision API for this task? Tesseract works very well with typewritten or printed text but does not seem to handle handwriting as well.[^1] Google’s Vision API, on the other hand, is able to extract handwritten text from images.

Preparation

Before doing anything in R you must create a project on Google Cloud Platform (GCP) and enable Vision API for your project.

In the appendix are some brief notes about creating a Vision-enabled project on GCP. However, if you are using GCP for the first time, I recommend following a tutorial like this one instead.

After setting up things on GCP you should have an access token.

To run any of the code below you must tell R what the access token is. But you have to be careful not to hard-code the access token in your source code. The access token bestows on its owner the power to make API requests which can incur costs for the owner of the GCP project.

To avoid putting the access token in your source code you can define an environment variable GCLOUD_ACCESS_TOKEN to be equal to the access token. Then you can access the value of the environment variable anywhere in your code with Sys.getenv("GCLOUD_ACCESS_TOKEN").

While it is possible to use Sys.setenv to define an environment variable a better method is to define it in an .Renviron file. One benefit of this approach is that the environment variable then persists between sessions.

The {usethis} (Wickham and Bryan (2021)) package makes it easy to edit .Renviron, at project or user scope.

usethis::edit_r_environ(scope = "user")

opens the user .Renviron in RStudio. Edit the file so that it contains a line like this:

GCLOUD_ACCESS_TOKEN=<your access token here>

Be careful not to commit this file to a Git repository. Anyone with the token can potentially make requests of Vision API through your project and force you to incur costs.

After defining your access token in .Renviron you must restart R. After restarting you should be able to use Sys.getenv to check that the environment variable is now equal to the access token.

Sys.getenv("GCLOUD_ACCESS_TOKEN")

Now you are ready to begin extracting text from images in R.

In the next section I’ll show you how to extract text from a single image file.

Later I’ll show you how to extract text from multiple images.

Extracting handwriting from a single image

In this section and subsequent sections I will make use of several images of my own handwriting.

So you can compare the output of Vision API against a known text I have handwritten several pages of James Oliver Curwood’s novel Baree, Son of Kazan. Later I’ll show you how to compare the results of extracting text from these handwritten pages to the text version of the novel from Project Gutenberg.

Creating document text detection requests

To use Vision API to extract text from images we must send images to the URL https://vision.googleapis.com/v1/images:annotate in JSON format. Putting images into JSON means encoding them as text. Vision API requires images encoded as text to use the Base64 encoding scheme. (see https://en.wikipedia.org/wiki/Base64 for more details).

The JSON payload of a text detection request to Vision API should look something like this:

{
  "requests": [
    {
      "image": {
        "content": <encoded-image>
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ]
    }
  ]
}

where <encoded-image> is a Base64 encoding of the image we want to extract text from.

Fortunately, you don’t have to figure out how to do the encoding yourself. The R package {base64enc} (Urbanek (2015)) can do that for you.

The document_text_detection function below uses base64enc::base64encode to compute a Base64 encoding of the image specified by the input path. It then packs the resulting encoding into a list of the right format for converting to JSON before being posted to the URL above.

document_text_detection <- function(image_path) {
  list(
    image = list(
      content = base64enc::base64encode(image_path)
    ),
    features = list(
      type = "DOCUMENT_TEXT_DETECTION"
    )
  )
}

As you will be using {httr} (Wickham (2020)) you don’t even have to worry about doing the conversion to JSON yourself. {httr} will do it automatically using {jsonlite} (Ooms (2014)) behind-the-scenes when you make your request.

Now you are ready to use document_text_detection to create a request based on the image located at: ~/workspace/baree-handwriting-scans/001.png

dtd_001 <- document_text_detection("~/workspace/baree-handwriting-scans/001.png")

Be wary of inspecting the return value of this function call. It contains a huge base64 string and R might take a very long time to output the string to the screen.

You can use substr to inspect part of it, if you really want.

substr(dtd_001, 1, 200)
#> [1] "list(content = \"iVBORw0KGgoAAAANSUhEUgAAE2EAABtoCAIAAAA8pmPCAAAAA3NCSVQICAjb4U/gAAAACXBIWXMAAFxGAABcRgEUlENBAAAgAElEQVR4nOzdwW7aQBRA0dB/9E8yH+kuUC1qCJiUxFx6zgIJMRo/GxaJzBWH4/E4xvj4+Dg9nkzTNE3T+dPVgu3O"
#> [2] "list(type = \"DOCUMENT_TEXT_DETECTION\")"

Another pitfall to be wary of is accidentally encoding the path to a file instead of the contents of the file itself. If the response from the Vision API has error messages containing paths then this can be a possible cause.

Posting requests to Vision API

Now you can use httr:POST from {httr} to post your request to the Vision API.

httr::POST requires at least url and body arguments.

url is the webpage to be retrieved. In this case "https://vision.googleapis.com/v1/images:annotate"

body is the request payload. In this case a named list with one element named requests whose value is a list of request objects.

As well as those required arguments you also have to tell the server that the content of your request is JSON. You do this by adding “Content-Type: application/json” to the header of your request. This means passing a call to httr::content_type_json() as one of the optional arguments to httr::POST.

The Vision API documentation says that for a request to be accepted it must have an Authorization = "Bearer <GCLOUD_ACCESS_TOKEN>" header. We can add the header using the httr::add_headers function.

httr::add_headers(
    "Authorization" = paste("Bearer", Sys.getenv("GCLOUD_ACCESS_TOKEN"))
)

Here we use Sys.getenv("GCLOUD_ACCESS_TOKEN") to obtain the value of our access token.

The post_vision function below takes a requests list as input and returns the response from Vision API’s annotation endpoint as a list.

post_vision <- function(requests) {
  httr::POST(
    url = "https://vision.googleapis.com/v1/images:annotate",
    body = requests,
    encode = "json",
    httr::content_type_json(),
    httr::add_headers(
      "Authorization" = paste("Bearer", Sys.getenv("GCLOUD_ACCESS_TOKEN"))
    )
  )
}

Finally, you are ready to use post_vision to make your first request.

post_vision expects a list of requests as input. In this case you only have one request dtd_001 which you created earlier by calling document_text_detection with the path to your image. Nevertheless, you still have to pack your single request inside a list.

l_r_001 <- list(requests = dtd_001)

Calling post_vision now sends your image to Vision API and returns the response.

r_001 <- post_vision(l_r_001)

You can use httr::status_code to check that a valid response was received.

httr::status_code(r_001)
#> [1] 200

The value should be 200.

If you get a 401 instead then inspect the value of r_001. If you see something like this:

Response [https://vision.googleapis.com/v1/images:annotate]
  Date: 2021-10-29 09:29
  Status: 401
  Content-Type: application/json; charset=UTF-8
  Size: 634 B
{
  "error": {
    "code": 401,
    "message": "Request had invalid authentication credentials. Expected OAuth 2 acc...
    "status": "UNAUTHENTICATED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        "reason": "ACCESS_TOKEN_EXPIRED",
        "domain": "googleapis.com",

then you might need to rerun the gcloud tool (see section 4 of the appendix) to generate a new access token.

In the next section I’ll explain how to get text out of a valid response.

Getting text from the response

If the response is valid then you can use httr::content to extract the content of the response.

content_001 <- httr::content(r_001)

The content contains a lot of information. The text is contained inside responses[[1]] as fullTextAnnotation$text.

baree_hw_001 <- content_001$responses[[1]]$fullTextAnnotation$text
cat(baree_hw_001)
#> a
#> vast
#> fear
#> To Parce, for many days after he was
#> born, the world was a
#> gloony cavern.
#> During these first days of his life his
#> home was in the heart of a great windfall
#> where aray wolf, his blind mother, had found
#> a a safe nest for his hahy hood, and to which
#> Kazan, her mate, came only now and then ,
#> his eyes gleaming like strange balls of greenish
#> fire in the darknen. It was kazan's eyes that
#> gave
#> do Barce his first impression of something
#> existing away from his mother's side, and they
#> brought to him also his discovery of vision. He
#> could feel, he could smell, he could hear - but
#> in that black pirt under the fallen timher he
#> had never seen until the
#> eyes
#> came. At first
#> they frightened nin; then they puzzled him , and
#> bis Heer changed to an immense ceniosity, the world
#> be looking foreight at them when all at once
#> they world disappear. This was when Kazan turned
#> his head. And then they would flash hach at him
#> again wt of the darknen with such startling
#> Suddenness that Baree world involuntanty Shrink
#> closer to his mother who always treunded and
#> Shivered in a strenge way when Kazan came in.
#> Barce, of course, would never know their story. He
#> world never know that Gray Wolf, his mother, was
#> a full-hlooded wolf, and that Kazan, his father,
#> was a dog. In hin nature was already
#> nature was already beginning
#> its wonderful work, but it world never go beyind
#> cerria limitations. It wald tell him, in time, ,
#> that his heavtiful wolf - mother was blind, hur
#> he world never know of that terrible hattle between
#> Gray Wolf and the lynx in which his mother's sight
#> had been destroyed Nature could tell hin gatting
#> nothing
#> +
#> а
#> (

A lot of the text is readable but there is a lot to do to fix all of the errors. And this is only one page! Maybe things would be better for someone with better handwriting than me.

In the next section I’ll try to quantify how close the text extracted by Vision API from the handwritten pages is to the original text by measuring edit distance to the equivalent page of text from Project Gutenberg.

How well does it work?

To download the original text from Project Gutenberg use the {gutenbergr} (Robinson (2020)) package.

baree <- gutenbergr::gutenberg_download(4748)

If you get an error here, try a different mirror.

From the text of the entire novel we can extract just the portion that corresponds to our scanned pages. In this case that means lines 96 to 148.

baree_tx <- paste(baree$text[96:148], collapse = " ")

With a bit more close inspection we can find the substring of the downloaded text corresponding to the first handwritten page. In this case the substring beginning at the 1st character and ending at the 1597th.

cat(baree_tx_001 <- stringr::str_sub(baree_tx, 1, 1597))
#> To Baree, for many days after he was born, the world was a vast gloomy cavern.  During these first days of his life his home was in the heart of a great windfall where Gray Wolf, his blind mother, had found a safe nest for his babyhood, and to which Kazan, her mate, came only now and then, his eyes gleaming like strange balls of greenish fire in the darkness. It was Kazan's eyes that gave to Baree his first impression of something existing away from his mother's side, and they brought to him also his discovery of vision. He could feel, he could smell, he could hear--but in that black pit under the fallen timber he had never seen until the eyes came. At first they frightened him; then they puzzled him, and his fear changed to an immense curiosity. He would be looking straight at them, when all at once they would disappear. This was when Kazan turned his head. And then they would flash back at him again out of the darkness with such startling suddenness that Baree would involuntarily shrink closer to his mother, who always trembled and shivered in a strange sort of way when Kazan came in.  Baree, of course, would never know their story. He would never know that Gray Wolf, his mother, was a full-blooded wolf, and that Kazan, his father, was a dog. In him nature was already beginning its wonderful work, but it would never go beyond certain limitations. It would tell him, in time, that his beautiful wolf mother was blind, but he would never know of that terrible battle between Gray Wolf and the lynx in which his mother's sight had been destroyed. Nature could tell him nothing

Now using the {stringdist} (der Loo (2014)) package calculate the edit distance between the text extracted from the handwritten page by Vision API and the string extracted from the Project Gutenberg text.

stringdist::stringdist(baree_hw_001, baree_tx_001, method = "lv")
#> [1] 174

Apparently we could change the handwriting based text to match the text from Project Gutenberg with 174 changes. Quite a lot for one page! However, bear in mind that edit distance is not quite the same as the number of changes you need to make to fix the document yourself. You could use spell check tools to fix some errors quickly and use search-and-replace to remove systematic errors like non-alphabetic characters (assuming your text is entirely alphabetic) or additional spacing.

In calculating edit distance we used the default optimal string alignment method from {stringdist}. But the result is the same if we use different methods like Levenshtein distance (method = "lv") or Full Damerau-Levenshtein distance (method = "dl").

Handling multiple pages

It is possible to send a multi page document to Vision API. A PDF, for example. A different approach to multi page documents is to send one request with multiple images, one image per page.

Building a request based on multiple images

Begin by putting images of all pages to be converted into the same folder. Here I’ve used the folder ~/workspace/baree-handwriting-scans.

scans_folder <- "~/workspace/baree-handwriting-scans"

Now use list.files with full.names = TRUE and pattern = '*.png' (assuming your images are PNG format) to get a list of all images.

scans <- list.files(scans_folder, pattern = '*.png', full.names = TRUE)

Next, iterate over scans with purrr::map and document_text_detection to create a list of JSON request objects, one for each page.

scans_dtd <- purrr::map(scans, document_text_detection)

As before, wrap this list of requests in another list.

l_r <- list(requests = scans_dtd)

Before calling post_vision and sending all images to Vision API.

response <- post_vision(l_r)

Notice that even though there are multiple images we still only make one request. The JSON payload contains encodings of all images.

Depending on the number of images in the requests list the above call to post_vision may take a few seconds to return a response.

As before, check that a valid response was received before opening it up and looking at what is inside.

httr::status_code(response)
#> [1] 200

If the response is valid you can iterate through the responses list inside the response content extracting fullTextAnnotation from each element.

This can be done with purrr::map by passing the name "fullTextAnnotation" as the function argument.

responses_annotations <- purrr::map(httr::content(response)$responses, "fullTextAnnotation")

purrr::map converts "fullTextAnnotation" into an extractor function which pulls out elements of the list argument having that name.

Using the same feature of purrr::map you can reach inside the annotations and pull out any text elements. This time using purrr::map_chr instead of purrr:map because the output should be a string.

purrr::map_chr(responses_annotations, "text")
#> [1] "a\nvast\nfear\nTo Parce, for many days after he was\nborn, the world was a\ngloony cavern.\nDuring these first days of his life his\nhome was in the heart of a great windfall\nwhere aray wolf, his blind mother, had found\na a safe nest for his hahy hood, and to which\nKazan, her mate, came only now and then ,\nhis eyes gleaming like strange balls of greenish\nfire in the darknen. It was kazan's eyes that\ngave\ndo Barce his first impression of something\nexisting away from his mother's side, and they\nbrought to him also his discovery of vision. He\ncould feel, he could smell, he could hear - but\nin that black pirt under the fallen timher he\nhad never seen until the\neyes\ncame. At first\nthey frightened nin; then they puzzled him , and\nbis Heer changed to an immense ceniosity, the world\nbe looking foreight at them when all at once\nthey world disappear. This was when Kazan turned\nhis head. And then they would flash hach at him\nagain wt of the darknen with such startling\nSuddenness that Baree world involuntanty Shrink\ncloser to his mother who always treunded and\nShivered in a strenge way when Kazan came in.\nBarce, of course, would never know their story. He\nworld never know that Gray Wolf, his mother, was\na full-hlooded wolf, and that Kazan, his father,\nwas a dog. In hin nature was already\nnature was already beginning\nits wonderful work, but it world never go beyind\ncerria limitations. It wald tell him, in time, ,\nthat his heavtiful wolf - mother was blind, hur\nhe world never know of that terrible hattle between\nGray Wolf and the lynx in which his mother's sight\nhad been destroyed Nature could tell hin gatting\nnothing\n+\nа\n(\n"
#> [2] "7\n9\n49\n7\nof Kazan's merciless vengeance 1 of the wonderful\nyears of their matehood of their loyalty, their\nShenge adventures in the great Canadian wilderness\ncit'culd make him arby a son of hazar.\nBut at first, and for many days, it was all\nMother. Even after his eyes opened wide and he\nhad pund his legs so that he could shonhce around\na little in the darkness, nothing existed ar buree\nfor\nhut his mother. When he was old enough to he\n.\nplaying with Shicks and mess art in the sunlight,\nhe still did not know what she looked like. But\nto him she was big and soft and warm, and she\nhicked his face with her tongue, and talked to him\nin a gentle, whimpening way that at lost made\nhim find his own voice in a faint, squeaky yap.\nAnd then came that wonderful day when the\ngreenish balls of fire that were kažan's eyes cancie\nnearer and nearer, a little at a tine, ,\ncarbiesky. Hereto pore Gray Wolf had warned hin\nhach. To he alone was the first law of her wild\nbreed during mothering time. A low snart from her\n. A\nthroat, ånd Kazan' had always stopped. But\nănd\non this day the snart did not come in aray\nWolf's throat it died away in a low, whimpering\nscond. A note of loneliness, of\ni 아\ngreat yearniny _“It's all night law,\" she was\nť keys, of a\nnow\nsaying to kázan; and katan\npowsing for a moment\nreplied with an answłni\nwswering\ndeep in his throat.\nStill slowly, as it not quite sure of what he\nЕ\nwould find, Kazan came to them, and Baree\nsnuggled closer to his mother\nas he dropped down heavily on his belly close to\naray Wolf. He was unafraid\nand nightily\nand\nvery\n.\nC\nto make ure -\nnote\nHe heard kazan\n"

The folder_to_chr function below puts all the above steps together.

Input to folder_to_chr is a path to a folder of images whose text we want to extract. The return value is list of strings containing all text contained in those images.

folder_to_chr <- function(scans_folder) {
  scans <- list.files(scans_folder, pattern = "*.png", full.names = TRUE)
  response <- post_vision(list(requests = purrr::map(scans, document_text_detection)))
  responses_annotations <- purrr::map(httr::content(response)$responses, "fullTextAnnotation")
  purrr::map_chr(responses_annotations, "text")
}

For example, to convert all the pages in scans_folder and combine the results into a single string do something like this:

baree_hw <- paste(folder_to_chr(scans_folder), collapse = " ")

How well does it work?

As before you can use the edit distance between baree_hw and the equivalent pages of the original text. Here, that text has been extracted before into baree_tx.

stringdist::stringdist(baree_hw, baree_tx)
#> [1] 446

Is this high? It’s hard to know without something to compare against.

For example, how does it compare to the edit distance between a random string of the same length as the target text and the target text itself?

random_text <- paste(sample(c(letters, " "), stringr::str_length(baree_tx), replace = TRUE), collapse = "")
stringdist::stringdist(random_text, baree_tx)
#> [1] 2882

It’s hardly surprising that this is a much higher number.

Another comparison we could make is against the result of using Tesseract to extract text from the same handwritten pages.

Fortunately, the {tesseract} (Ooms (2021)) package makes this very easy.

folder_to_chr_ts <- function(scans_folder) {
  eng <- tesseract::tesseract("eng")
  scans <- list.files(scans_folder, pattern = "*.png", full.names = TRUE)
  paste(purrr::map(scans, tesseract::ocr, engine = eng), collapse = " ")
}

baree_hw_ts <- folder_to_chr_ts(scans_folder)

stringdist::stringdist(baree_hw_ts, baree_tx)
#> [1] 1502

So Vision API does much better than Tesseract according to this particular little test. But this is probably not a fair comparison. Tesseract doesn’t claim to be able to extract text from images of handwriting. Furthermore, it may be possible to configure Tesseract in a way that improves the results or to modify the input images to make them better suited to Tesseract. Also, I’m only looking at my own handwriting. It might be that another person’s handwriting works better with Tesseract than mine.

My objective was just to see if using Vision API made it possible to convert my own handwritten documents into text. While promising it seems that I might spend longer fixing errors in the resulting text than it would take to type them out from scratch.

Appendix (Google Cloud Platform)

If you haven’t worked with Vision API before the read the following tutorial first: https://cloud.google.com/vision/docs/setup

You should only need to do steps 0 - 2 once.
Step 3 should only need to be done once unless the private key has expired.
If you have already created a project and service account jump to step 3.
If you already have the private key as well jump to step 4.

0. Install the cloud SDK

You need to install the glcoud tool to configure authentication.

The installation instructions are here: https://cloud.google.com/sdk/docs/install

1. Create project

First you need to create a GCP project and enable Vision API for that project.

You will have to setup billing when creating the project if you haven’t done so before.

Go to the Project Selector: https://console.cloud.google.com/projectselector2/home/dashboard and select Create Project
Give your project a name and click Create.
Eventually the Project Dashboard appears. At the time of writing somewhere in the bottom-left of the dashboard is a Getting Started panel containing a link named Explore and enable APIs. Click on it.
You will be transported to the API dashboard.
At the top you should see + ENABLE APIs AND SERVICES. Click on that.
Now you get a search dialog.
Type vision and press enter.
Select Cloud Vision API and on the page that appears click the Enable button.

2. Create service account

Then you need to create a service account.

This is a good point to just give up.

Strictly speaking this isn’t necessary but it does seem to be the most straightforward way of enable authentication for your project.

Click the hamburger in the navigation bar to open the sidebar menu.
Scroll down to IAM & Admin and select Service Accounts.
Click on Create Service Account.
Give your account a name.
Click Create and continue

3. Download private key

In the Service accounts dashboard find the service account you created and click on the three dots below Action.
Select Manage Keys from the drop-down menu.
On the page that open click on the ADD KEY button.
Choose Create New Key from the drop-down menu.
Click Create on the modal dialog that opens.
You will be prompted to save your key.
Download your private key and remember where you save it (the save location is referenced below as <PATH_TO_PRIVATE_KEY>).

4. Use gcloud tool to get access token

Now you can use gcloud to get the access token required in the above tutorial.

First, define an environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to the location where you saved the private key.

$ export GOOGLE_APPLICATION_CREDENTIALS=<PATH_TO_PRIVATE_KEY>

Then run

$ gcloud auth application-default print-access-token

The output will be your access token followed by lots of dots.

Extracting Text from Images with Google Vision API and R

Preparation

Extracting handwriting from a single image

Creating document text detection requests

Posting requests to Vision API

Getting text from the response

How well does it work?

Handling multiple pages

Building a request based on multiple images

How well does it work?

Appendix (Google Cloud Platform)

0. Install the cloud SDK

1. Create project

2. Create service account

3. Download private key

4. Use gcloud tool to get access token

References