Extracting known entities from text

285
January 05, 2017, at 06:40 AM

I have ocr of a image what contains a bunch of text. The text has the name of the brand and location along with all text. Now i have the list of all the brands and list of all the location. How do i know which brand and location the text talking about. Text is usually 30 or so words max.

Also since it's ocr a letter or two may be off. It should still match.

Answer 1

There is a powerful tool that almost all programming languages support called "Regular Expressions" (Aka: RegEx). Regular expressions allow you to search for and extract content in a large amount of text.

You can install the module for python by using

pip install regex

Regular Expressions are like a mini programming language within your programming language. They provide you with powerful and specific wildcard characters.

Here is an example of using regular expressions in python to extract an email and website from a block of text:

import re
#Example String
x = "Lorem ipsum www.stackoverflow.com dolor sit amet, example@website.com consectetur adipiscing elit."
#Extract website
website = re.findall(".*(www.\S*)", x)
print(website[0])
#Extract email
email = re.findall(".*\s(\S*@\S*)", x)
print(email[0])

If you were to run this, it would output www.stackoverflow.com and example@website.com. The characters . * \s and \S are wildcard characters. Here is what each piece does:

.*     Matches any character 0 or more times
\s     Matches any white space character
\S*    Matches any non-white space character 0 or more times

The ( and ) say what to extract.

By using these wild cards, you can search for and extract any information from a block of text, text document, etc. In your case, you will be looking for the brand and location.

You can learn more about how to use regular expressions on: https://docs.python.org/3.4/howto/regex.html

Here is a helpful reference page for writing regular expressions: https://s-media-cache-ak0.pinimg.com/originals/54/83/08/5483089e3aa56bbee7c194370d1a22d7.png

And (just to make you smile) here's a funny comic about regular expressions: https://xkcd.com/208/

READ ALSO
Angular App: Using two domains to point into same app's different states to use as home page

Angular App: Using two domains to point into same app's different states to use as home page

I am planning to build a web application in AngularJS using Node JS and MonogoDBHere I am confused a little bit, the basic functionality of the app will be:

286
Hash of the file with stream without changing output

Hash of the file with stream without changing output

I was checking how to use the streams in the built-in crypto module of node, and it works just fine for the very simple example (was using the answer from this question Obtaining the hash of a file using the stream capabilities of crypto module (ie: without...

306
A JSON database with behaviour like this file script

A JSON database with behaviour like this file script

I know that I can parse a json file and then write it down backI did it with this nodejs script

304
Socket.IO - Server doesn't send response to client under SSL

Socket.IO - Server doesn't send response to client under SSL

I'm comfronted with a big problem with socketio ans NodeJS

399