In this post I’ll show you how to use Google Cloud Platform’s Vision API to extract text from images in R.
I’m mostly interested in the case when the text being extracted is handwritten. For typewritten or printed text I use Tesseract.
Is it feasible to use Vision API for this task? Tesseract works very well with typewritten or printed text but does not seem to handle handwriting as well.[^1] Google’s Vision API, on the other hand, is able to extract handwritten text from images.
Preparation
Before doing anything in R you must create a project on Google Cloud Platform (GCP) and enable Vision API for your project.
In the appendix are some brief notes about creating a Vision-enabled project on GCP. However, if you are using GCP for the first time, I recommend following a tutorial like this one instead.
After setting up things on GCP you should have an access token.
To run any of the code below you must tell R what the access token is. But you have to be careful not to hard-code the access token in your source code. The access token bestows on its owner the power to make API requests which can incur costs for the owner of the GCP project.
To avoid putting the access token in your source code you can define an environment variable GCLOUD_ACCESS_TOKEN
to be equal to the access token. Then you can access the value of the environment variable anywhere in your code with Sys.getenv("GCLOUD_ACCESS_TOKEN")
.
While it is possible to use Sys.setenv
to define an environment variable a better method is to define it in an .Renviron
file. One benefit of this approach is that the environment variable then persists between sessions.
The {usethis} (Wickham and Bryan (2021)) package makes it easy to edit .Renviron
, at project or user scope.
::edit_r_environ(scope = "user") usethis
opens the user .Renviron
in RStudio. Edit the file so that it contains a line like this:
GCLOUD_ACCESS_TOKEN=<your access token here>
Be careful not to commit this file to a Git repository. Anyone with the token can potentially make requests of Vision API through your project and force you to incur costs.
After defining your access token in .Renviron
you must restart R. After restarting you should be able to use Sys.getenv
to check that the environment variable is now equal to the access token.
Sys.getenv("GCLOUD_ACCESS_TOKEN")
Now you are ready to begin extracting text from images in R.
In the next section I’ll show you how to extract text from a single image file.
Later I’ll show you how to extract text from multiple images.
Extracting handwriting from a single image
In this section and subsequent sections I will make use of several images of my own handwriting.
So you can compare the output of Vision API against a known text I have handwritten several pages of James Oliver Curwood’s novel Baree, Son of Kazan. Later I’ll show you how to compare the results of extracting text from these handwritten pages to the text version of the novel from Project Gutenberg.
Creating document text detection requests
To use Vision API to extract text from images we must send images to the URL https://vision.googleapis.com/v1/images:annotate
in JSON format. Putting images into JSON means encoding them as text. Vision API requires images encoded as text to use the Base64 encoding scheme. (see https://en.wikipedia.org/wiki/Base64 for more details).
The JSON payload of a text detection request to Vision API should look something like this:
{
"requests": [
{
"image": {
"content": <encoded-image>
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
]
}
]
}
where <encoded-image>
is a Base64 encoding of the image we want to extract text from.
Fortunately, you don’t have to figure out how to do the encoding yourself. The R package {base64enc} (Urbanek (2015)) can do that for you.
The document_text_detection
function below uses base64enc::base64encode
to compute a Base64 encoding of the image specified by the input path. It then packs the resulting encoding into a list of the right format for converting to JSON before being posted to the URL above.
<- function(image_path) {
document_text_detection list(
image = list(
content = base64enc::base64encode(image_path)
),features = list(
type = "DOCUMENT_TEXT_DETECTION"
)
) }
As you will be using {httr} (Wickham (2020)) you don’t even have to worry about doing the conversion to JSON yourself. {httr} will do it automatically using {jsonlite} (Ooms (2014)) behind-the-scenes when you make your request.
Now you are ready to use document_text_detection
to create a request based on the image located at: ~/workspace/baree-handwriting-scans/001.png
<- document_text_detection("~/workspace/baree-handwriting-scans/001.png") dtd_001
Be wary of inspecting the return value of this function call. It contains a huge base64 string and R might take a very long time to output the string to the screen.
You can use substr
to inspect part of it, if you really want.
substr(dtd_001, 1, 200)
#> [1] "list(content = \"iVBORw0KGgoAAAANSUhEUgAAE2EAABtoCAIAAAA8pmPCAAAAA3NCSVQICAjb4U/gAAAACXBIWXMAAFxGAABcRgEUlENBAAAgAElEQVR4nOzdwW7aQBRA0dB/9E8yH+kuUC1qCJiUxFx6zgIJMRo/GxaJzBWH4/E4xvj4+Dg9nkzTNE3T+dPVgu3O"
#> [2] "list(type = \"DOCUMENT_TEXT_DETECTION\")"
Another pitfall to be wary of is accidentally encoding the path to a file instead of the contents of the file itself. If the response from the Vision API has error messages containing paths then this can be a possible cause.
Posting requests to Vision API
Now you can use httr:POST
from {httr} to post your request to the Vision API.
httr::POST
requires at least url
and body
arguments.
url
is the webpage to be retrieved. In this case "https://vision.googleapis.com/v1/images:annotate"
body
is the request payload. In this case a named list with one element named requests
whose value is a list of request objects.
As well as those required arguments you also have to tell the server that the content of your request is JSON. You do this by adding “Content-Type: application/json” to the header of your request. This means passing a call to httr::content_type_json()
as one of the optional arguments to httr::POST
.
The Vision API documentation says that for a request to be accepted it must have an Authorization = "Bearer <GCLOUD_ACCESS_TOKEN>"
header. We can add the header using the httr::add_headers
function.
::add_headers(
httr"Authorization" = paste("Bearer", Sys.getenv("GCLOUD_ACCESS_TOKEN"))
)
Here we use Sys.getenv("GCLOUD_ACCESS_TOKEN")
to obtain the value of our access token.
The post_vision
function below takes a requests list as input and returns the response from Vision API’s annotation endpoint as a list.
<- function(requests) {
post_vision ::POST(
httrurl = "https://vision.googleapis.com/v1/images:annotate",
body = requests,
encode = "json",
::content_type_json(),
httr::add_headers(
httr"Authorization" = paste("Bearer", Sys.getenv("GCLOUD_ACCESS_TOKEN"))
)
) }
Finally, you are ready to use post_vision
to make your first request.
post_vision
expects a list of requests as input. In this case you only have one request dtd_001
which you created earlier by calling document_text_detection
with the path to your image. Nevertheless, you still have to pack your single request inside a list.
<- list(requests = dtd_001) l_r_001
Calling post_vision
now sends your image to Vision API and returns the response.
<- post_vision(l_r_001) r_001
You can use httr::status_code
to check that a valid response was received.
::status_code(r_001)
httr#> [1] 200
The value should be 200.
If you get a 401 instead then inspect the value of r_001
. If you see something like this:
Response [https://vision.googleapis.com/v1/images:annotate]
Date: 2021-10-29 09:29
Status: 401
Content-Type: application/json; charset=UTF-8
Size: 634 B
{
"error": {
"code": 401,
"message": "Request had invalid authentication credentials. Expected OAuth 2 acc...
"status": "UNAUTHENTICATED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "ACCESS_TOKEN_EXPIRED",
"domain": "googleapis.com",
then you might need to rerun the gcloud tool (see section 4 of the appendix) to generate a new access token.
In the next section I’ll explain how to get text out of a valid response.
Getting text from the response
If the response is valid then you can use httr::content
to extract the content of the response.
<- httr::content(r_001) content_001
The content contains a lot of information. The text is contained inside responses[[1]]
as fullTextAnnotation$text
.
<- content_001$responses[[1]]$fullTextAnnotation$text
baree_hw_001 cat(baree_hw_001)
#> a
#> vast
#> fear
#> To Parce, for many days after he was
#> born, the world was a
#> gloony cavern.
#> During these first days of his life his
#> home was in the heart of a great windfall
#> where aray wolf, his blind mother, had found
#> a a safe nest for his hahy hood, and to which
#> Kazan, her mate, came only now and then ,
#> his eyes gleaming like strange balls of greenish
#> fire in the darknen. It was kazan's eyes that
#> gave
#> do Barce his first impression of something
#> existing away from his mother's side, and they
#> brought to him also his discovery of vision. He
#> could feel, he could smell, he could hear - but
#> in that black pirt under the fallen timher he
#> had never seen until the
#> eyes
#> came. At first
#> they frightened nin; then they puzzled him , and
#> bis Heer changed to an immense ceniosity, the world
#> be looking foreight at them when all at once
#> they world disappear. This was when Kazan turned
#> his head. And then they would flash hach at him
#> again wt of the darknen with such startling
#> Suddenness that Baree world involuntanty Shrink
#> closer to his mother who always treunded and
#> Shivered in a strenge way when Kazan came in.
#> Barce, of course, would never know their story. He
#> world never know that Gray Wolf, his mother, was
#> a full-hlooded wolf, and that Kazan, his father,
#> was a dog. In hin nature was already
#> nature was already beginning
#> its wonderful work, but it world never go beyind
#> cerria limitations. It wald tell him, in time, ,
#> that his heavtiful wolf - mother was blind, hur
#> he world never know of that terrible hattle between
#> Gray Wolf and the lynx in which his mother's sight
#> had been destroyed Nature could tell hin gatting
#> nothing
#> +
#> а
#> (
A lot of the text is readable but there is a lot to do to fix all of the errors. And this is only one page! Maybe things would be better for someone with better handwriting than me.
In the next section I’ll try to quantify how close the text extracted by Vision API from the handwritten pages is to the original text by measuring edit distance to the equivalent page of text from Project Gutenberg.
How well does it work?
To download the original text from Project Gutenberg use the {gutenbergr} (Robinson (2020)) package.
<- gutenbergr::gutenberg_download(4748) baree
If you get an error here, try a different mirror.
From the text of the entire novel we can extract just the portion that corresponds to our scanned pages. In this case that means lines 96 to 148.
<- paste(baree$text[96:148], collapse = " ") baree_tx
With a bit more close inspection we can find the substring of the downloaded text corresponding to the first handwritten page. In this case the substring beginning at the 1st character and ending at the 1597th.
cat(baree_tx_001 <- stringr::str_sub(baree_tx, 1, 1597))
#> To Baree, for many days after he was born, the world was a vast gloomy cavern. During these first days of his life his home was in the heart of a great windfall where Gray Wolf, his blind mother, had found a safe nest for his babyhood, and to which Kazan, her mate, came only now and then, his eyes gleaming like strange balls of greenish fire in the darkness. It was Kazan's eyes that gave to Baree his first impression of something existing away from his mother's side, and they brought to him also his discovery of vision. He could feel, he could smell, he could hear--but in that black pit under the fallen timber he had never seen until the eyes came. At first they frightened him; then they puzzled him, and his fear changed to an immense curiosity. He would be looking straight at them, when all at once they would disappear. This was when Kazan turned his head. And then they would flash back at him again out of the darkness with such startling suddenness that Baree would involuntarily shrink closer to his mother, who always trembled and shivered in a strange sort of way when Kazan came in. Baree, of course, would never know their story. He would never know that Gray Wolf, his mother, was a full-blooded wolf, and that Kazan, his father, was a dog. In him nature was already beginning its wonderful work, but it would never go beyond certain limitations. It would tell him, in time, that his beautiful wolf mother was blind, but he would never know of that terrible battle between Gray Wolf and the lynx in which his mother's sight had been destroyed. Nature could tell him nothing
Now using the {stringdist} (der Loo (2014)) package calculate the edit distance between the text extracted from the handwritten page by Vision API and the string extracted from the Project Gutenberg text.
::stringdist(baree_hw_001, baree_tx_001, method = "lv")
stringdist#> [1] 174
Apparently we could change the handwriting based text to match the text from Project Gutenberg with 174 changes. Quite a lot for one page! However, bear in mind that edit distance is not quite the same as the number of changes you need to make to fix the document yourself. You could use spell check tools to fix some errors quickly and use search-and-replace to remove systematic errors like non-alphabetic characters (assuming your text is entirely alphabetic) or additional spacing.
In calculating edit distance we used the default optimal string alignment method from {stringdist}. But the result is the same if we use different methods like Levenshtein distance (method = "lv"
) or Full Damerau-Levenshtein distance (method = "dl"
).
Handling multiple pages
It is possible to send a multi page document to Vision API. A PDF, for example. A different approach to multi page documents is to send one request with multiple images, one image per page.
Building a request based on multiple images
Begin by putting images of all pages to be converted into the same folder. Here I’ve used the folder ~/workspace/baree-handwriting-scans
.
<- "~/workspace/baree-handwriting-scans" scans_folder
Now use list.files
with full.names = TRUE
and pattern = '*.png'
(assuming your images are PNG format) to get a list of all images.
<- list.files(scans_folder, pattern = '*.png', full.names = TRUE) scans
Next, iterate over scans
with purrr::map
and document_text_detection
to create a list of JSON request objects, one for each page.
<- purrr::map(scans, document_text_detection) scans_dtd
As before, wrap this list of requests in another list.
<- list(requests = scans_dtd) l_r
Before calling post_vision
and sending all images to Vision API.
<- post_vision(l_r) response
Notice that even though there are multiple images we still only make one request. The JSON payload contains encodings of all images.
Depending on the number of images in the requests
list the above call to post_vision
may take a few seconds to return a response.
As before, check that a valid response was received before opening it up and looking at what is inside.
::status_code(response)
httr#> [1] 200
If the response is valid you can iterate through the responses
list inside the response content extracting fullTextAnnotation
from each element.
This can be done with purrr::map
by passing the name "fullTextAnnotation"
as the function argument.
<- purrr::map(httr::content(response)$responses, "fullTextAnnotation") responses_annotations
purrr::map
converts "fullTextAnnotation"
into an extractor function which pulls out elements of the list argument having that name.
Using the same feature of purrr::map
you can reach inside the annotations and pull out any text
elements. This time using purrr::map_chr
instead of purrr:map
because the output should be a string.
::map_chr(responses_annotations, "text")
purrr#> [1] "a\nvast\nfear\nTo Parce, for many days after he was\nborn, the world was a\ngloony cavern.\nDuring these first days of his life his\nhome was in the heart of a great windfall\nwhere aray wolf, his blind mother, had found\na a safe nest for his hahy hood, and to which\nKazan, her mate, came only now and then ,\nhis eyes gleaming like strange balls of greenish\nfire in the darknen. It was kazan's eyes that\ngave\ndo Barce his first impression of something\nexisting away from his mother's side, and they\nbrought to him also his discovery of vision. He\ncould feel, he could smell, he could hear - but\nin that black pirt under the fallen timher he\nhad never seen until the\neyes\ncame. At first\nthey frightened nin; then they puzzled him , and\nbis Heer changed to an immense ceniosity, the world\nbe looking foreight at them when all at once\nthey world disappear. This was when Kazan turned\nhis head. And then they would flash hach at him\nagain wt of the darknen with such startling\nSuddenness that Baree world involuntanty Shrink\ncloser to his mother who always treunded and\nShivered in a strenge way when Kazan came in.\nBarce, of course, would never know their story. He\nworld never know that Gray Wolf, his mother, was\na full-hlooded wolf, and that Kazan, his father,\nwas a dog. In hin nature was already\nnature was already beginning\nits wonderful work, but it world never go beyind\ncerria limitations. It wald tell him, in time, ,\nthat his heavtiful wolf - mother was blind, hur\nhe world never know of that terrible hattle between\nGray Wolf and the lynx in which his mother's sight\nhad been destroyed Nature could tell hin gatting\nnothing\n+\nа\n(\n"
#> [2] "7\n9\n49\n7\nof Kazan's merciless vengeance 1 of the wonderful\nyears of their matehood of their loyalty, their\nShenge adventures in the great Canadian wilderness\ncit'culd make him arby a son of hazar.\nBut at first, and for many days, it was all\nMother. Even after his eyes opened wide and he\nhad pund his legs so that he could shonhce around\na little in the darkness, nothing existed ar buree\nfor\nhut his mother. When he was old enough to he\n.\nplaying with Shicks and mess art in the sunlight,\nhe still did not know what she looked like. But\nto him she was big and soft and warm, and she\nhicked his face with her tongue, and talked to him\nin a gentle, whimpening way that at lost made\nhim find his own voice in a faint, squeaky yap.\nAnd then came that wonderful day when the\ngreenish balls of fire that were kažan's eyes cancie\nnearer and nearer, a little at a tine, ,\ncarbiesky. Hereto pore Gray Wolf had warned hin\nhach. To he alone was the first law of her wild\nbreed during mothering time. A low snart from her\n. A\nthroat, ånd Kazan' had always stopped. But\nănd\non this day the snart did not come in aray\nWolf's throat it died away in a low, whimpering\nscond. A note of loneliness, of\ni 아\ngreat yearniny _“It's all night law,\" she was\nť keys, of a\nnow\nsaying to kázan; and katan\npowsing for a moment\nreplied with an answłni\nwswering\ndeep in his throat.\nStill slowly, as it not quite sure of what he\nЕ\nwould find, Kazan came to them, and Baree\nsnuggled closer to his mother\nas he dropped down heavily on his belly close to\naray Wolf. He was unafraid\nand nightily\nand\nvery\n.\nC\nto make ure -\nnote\nHe heard kazan\n"
The folder_to_chr
function below puts all the above steps together.
Input to folder_to_chr
is a path to a folder of images whose text we want to extract. The return value is list of strings containing all text contained in those images.
<- function(scans_folder) {
folder_to_chr <- list.files(scans_folder, pattern = "*.png", full.names = TRUE)
scans <- post_vision(list(requests = purrr::map(scans, document_text_detection)))
response <- purrr::map(httr::content(response)$responses, "fullTextAnnotation")
responses_annotations ::map_chr(responses_annotations, "text")
purrr }
For example, to convert all the pages in scans_folder
and combine the results into a single string do something like this:
<- paste(folder_to_chr(scans_folder), collapse = " ") baree_hw
How well does it work?
As before you can use the edit distance between baree_hw
and the equivalent pages of the original text. Here, that text has been extracted before into baree_tx
.
::stringdist(baree_hw, baree_tx)
stringdist#> [1] 446
Is this high? It’s hard to know without something to compare against.
For example, how does it compare to the edit distance between a random string of the same length as the target text and the target text itself?
<- paste(sample(c(letters, " "), stringr::str_length(baree_tx), replace = TRUE), collapse = "")
random_text ::stringdist(random_text, baree_tx)
stringdist#> [1] 2882
It’s hardly surprising that this is a much higher number.
Another comparison we could make is against the result of using Tesseract to extract text from the same handwritten pages.
Fortunately, the {tesseract} (Ooms (2021)) package makes this very easy.
<- function(scans_folder) {
folder_to_chr_ts <- tesseract::tesseract("eng")
eng <- list.files(scans_folder, pattern = "*.png", full.names = TRUE)
scans paste(purrr::map(scans, tesseract::ocr, engine = eng), collapse = " ")
}
<- folder_to_chr_ts(scans_folder) baree_hw_ts
::stringdist(baree_hw_ts, baree_tx)
stringdist#> [1] 1502
So Vision API does much better than Tesseract according to this particular little test. But this is probably not a fair comparison. Tesseract doesn’t claim to be able to extract text from images of handwriting. Furthermore, it may be possible to configure Tesseract in a way that improves the results or to modify the input images to make them better suited to Tesseract. Also, I’m only looking at my own handwriting. It might be that another person’s handwriting works better with Tesseract than mine.
My objective was just to see if using Vision API made it possible to convert my own handwritten documents into text. While promising it seems that I might spend longer fixing errors in the resulting text than it would take to type them out from scratch.
Appendix (Google Cloud Platform)
If you haven’t worked with Vision API before the read the following tutorial first: https://cloud.google.com/vision/docs/setup
- You should only need to do steps 0 - 2 once.
- Step 3 should only need to be done once unless the private key has expired.
- If you have already created a project and service account jump to step 3.
- If you already have the private key as well jump to step 4.
0. Install the cloud SDK
You need to install the glcoud
tool to configure authentication.
The installation instructions are here: https://cloud.google.com/sdk/docs/install
1. Create project
First you need to create a GCP project and enable Vision API for that project.
You will have to setup billing when creating the project if you haven’t done so before.
- Go to the Project Selector: https://console.cloud.google.com/projectselector2/home/dashboard and select Create Project
- Give your project a name and click Create.
- Eventually the Project Dashboard appears. At the time of writing somewhere in the bottom-left of the dashboard is a Getting Started panel containing a link named Explore and enable APIs. Click on it.
- You will be transported to the API dashboard.
- At the top you should see + ENABLE APIs AND SERVICES. Click on that.
- Now you get a search dialog.
- Type vision and press enter.
- Select Cloud Vision API and on the page that appears click the Enable button.
2. Create service account
Then you need to create a service account.
This is a good point to just give up.
Strictly speaking this isn’t necessary but it does seem to be the most straightforward way of enable authentication for your project.
- Click the hamburger in the navigation bar to open the sidebar menu.
- Scroll down to IAM & Admin and select Service Accounts.
- Click on Create Service Account.
- Give your account a name.
- Click Create and continue
3. Download private key
- In the Service accounts dashboard find the service account you created and click on the three dots below Action.
- Select Manage Keys from the drop-down menu.
- On the page that open click on the ADD KEY button.
- Choose Create New Key from the drop-down menu.
- Click Create on the modal dialog that opens.
- You will be prompted to save your key.
- Download your private key and remember where you save it (the save location is referenced below as
<PATH_TO_PRIVATE_KEY>
).
4. Use gcloud tool to get access token
Now you can use gcloud
to get the access token required in the above tutorial.
First, define an environment variable GOOGLE_APPLICATION_CREDENTIALS
pointing to the location where you saved the private key.
$ export GOOGLE_APPLICATION_CREDENTIALS=<PATH_TO_PRIVATE_KEY>
Then run
$ gcloud auth application-default print-access-token
The output will be your access token followed by lots of dots.