Tuesday, March 17, 2009

Understanding Natural Language Processing for SEO


Not long ago we got word that a new search engine will launch in May that will rely heavily on Natural Language Processing (NLP). And we have even heard Eric Schmidt, CEO of Google, hint that the search giant would be implementing a greater emphasis on NLP:

Wouldn’t it be nice if Google understood the meaning of your phrase rather than just the words that are in that phrase? We have a lot of discoveries in that area that are going to roll out in the next little while. (via)

Marie-Claire JenkinsAs search marketers it’s important that we stay on top of the world of search technologies and trends. Therefore I thought it would be a good idea to learn the basis of NLP. For this I have turned to a true expert in the field. Marie-Claire Jenkins (a.k.a CJ) is completing a PhD in Natural Language Processing and Artificial Intelligence at the University of East Anglia. She is an extremely seasoned SEO who has worked with corporate clients on all things related to search. You can learn more about CJ by visiting her web site Science for SEO.

What follows is my interview with CJ about Natural Language Processing:

Joe: Can you define Natural Language Processing for the layman?

CJ: It is the area of computing that deals with words. Words form texts, texts form collections of many texts, leading to an awful lot of words. This “bag of words” needs to be analyzed so that a machine can make sense of them. If it can’t, there is no way of retrieving information from the texts and so they are in effect useless.

Linguistic analysis is used to analyze and represent the texts. Once the computer can represent the language in some way (patterns, reoccurring words, synonymy…) the information in the texts can be manipulated. This mimics the human understanding of language. NLP is a sub field of artificial intelligence and computational linguistics.

A little history…

NLP was first used in 1948 for use in a look-up dictionary at Birkbeck College in London. The code-breakers from the 2nd World War were highly interested as this was a new thing to work on now the war was over. Machine translation was actually the 1st research area involving NLP in the 1950’s, a problem that has still not been solved today. NLP uses a wealth of research done by linguists such as Noam Chomsky, John Fillmore who revolutionized the way we search for patterns in language which is what NLP is all about. Research has since continued and improvements have been made.

It is a very old area of research, and an extremely difficult one. The double helix has been discovered, quantum mechanics, the radar, kidney dialysis machines, man has walked on the moon…but the problem of NLP has not yet been solved, although there have been some breakthroughs. This how hard it is.

Joe: Are there any examples of NLP in our daily life, online or off, that can help us understand this technology a bit more?

CJ: NLP is used in a great many areas of our lives. A simple example is the spell checker and grammar checker in your word processing software. When you call the bank and get an auto responder, that also uses NLP. All search engines, mobile phones (predictive text and speech recognition for example), digital encyclopedias, dictionaries, washing machines, software for sign language for the deaf, databases, computer games, and there are many many more applications of the technology.

Joe: We have heard rumors for sometime now that Google and some other smaller search engines are developing NLP in their search platforms. Is this a credible theory? If so, will NLP become a trend in search development?

CJ: All search engines have been using NLP since the 50’s. A search engine is not necessarily Google, it can also be a part of machine translation technology for example and many others. In order to make sense of words on web pages, if we take the example of a web search engine, it needs to process them so it can use them to find patterns which allow it to classify everything. As well as that, it enables the engine to analyze your query also, and feed the results into the search engine which then uses them to provide you with an answer.

It is not a credible theory, it is a certainty. NLP is definitely set to develop in the future. There are major issues with it at the moment. Finding patterns between the words, and analyzing which grammatical group they belong to for example, does not tell us much about the topic of the text. To achieve this we need to do some context analysis which also uses NLP. All computer programs which deal with language use NLP.

A small list of examples:

* Machine translation
* Natural language understanding
* Automatic summarization
* Information retrieval
* Natural language generation
* Topic detection
* Optical character recognition (OCR—as used for Google books)

Joe: Is there a relationship between SEO and NLP? If so, what is it?

CJ: There is indeed a relationship. The search engines use words to assess what a web page is about, using NLP amongst other techniques. The content on a web page will help determine what the topic of the page is.

Understanding the techniques used in NLP allows us to provide the best format and patterns for the search engine. In fact I think that the entire site is affected because analysing a whole site, each page, helps to determine exactly what a site is about. Seeing as NLP seeks to mimic human language understanding, using common sense is a good idea. This is why search engines always recommend writing good relevant content.

NLP is a complex area of research, requiring a solid understanding of grammars (not just grammar), and a good grounding in computational linguists (in order to apply the techniques to machine, which is not always easy). For the purpose of SEO learning all about this would be overkill, and a waste of time, although very interesting! I recommend understanding the basic way that the technology functions and having a good idea of how to write clearly, using the correct vocabulary, in a focused context is enough.

Joe: What area of NLP are you working in? And, can you share with us any interesting aspects of your research?

CJ: My early research was on machine translation systems and naturally search engines and natural language processing. After a few years working in that area I took a detour.

My current research is in natural language understanding and generation. My current test application is on conversational agents for customer service providers. It cuts down on costs and allows the customers to always have access to information and data easily. They ask natural language questions, make statements, ask for advice. It uses natural language processing techniques all over.

I also use a lot of artificial intelligence methods, information retrieval, web 3.0 things like the OWL language for example. I use a huge amount of linguistic research, information extraction, different programming languages, cluster analysis and all sorts of other things. The challenges are monumental but it is fascinating.