(Originally posted August 28, 2013)
I’ve always been interested in Natural Language Processing (NLP), so I wanted to try my hand at a simple article summarizer. The basic idea is that we want to boil down the article to only its most important sentences. Disregard the fluff, and return the ones with the most information. Sounds simple, but the determination wasn’t exactly obvious off the bat. Even trying to rank sentences on my own was tough. After a few hours of research, I came across this post which had a very clever way of determining the important sentences. The important sentences in the article should be those who share the most words with other sentences. To get an idea of this, we realize that an important sentences should have information, and supporting sentences should explain the parts of the main one.
To calculate this, we create a connected graph between sentences where each link is the number of words in common between the sentences, normalized by length. We represent this graph as a matrix and simply loop through the sentences and compare the words. This is the naive approach that the post’s author makes, but he also gives a few suggestions for improvement, such as stemming the words and removing stopwords. Stemming deals with removing pluralizations and other non-root endings to words. For example stemming roots turns to root etc. Stopwords are just common words such as ‘and’ or ‘or’ which shouldn’t provide much information about the topic of the sentence. For these techniques, python’s Natural Language Toolkit is fantastic and provides this out of the box.
After ranking all the sentences, the final step is to determine how to determine how to display the shortened article. The way the post’s author did this was by picking the best sentences from each paragraph. I wanted to be able to shorten the length arbitrarily, so I decided to, at least at the moment, display the most informative X sentences in the order they were written, where X is arbitrary.
At the moment, there are still many improvements to be done. The algorithm does well for those “stock” articles with just information. Opinion pieces are a little tougher to boil down to just the main points. By modifying some of the pieces or the ranking algorithm, it should be able to perform well no matter what the content.
Edit:
After running the above article through the algorithm, I got the following 7 sentences:
Sent 1: The basic idea is that we want to boil down the article to only its most important sentences.
Sent 5: After a few hours of research, I came across this post which had a very clever way of determining the important sentences.
Sent 6: The important sentences in the article should be those who share the most words with other sentences.
Sent 7: To get an idea of this, we realize that an important sentences should have information, and supporting sentences should explain the parts of the main one.
Sent 8: To calculate this, we create a connected graph between sentences where each link is the number of words in common between the sentences, normalized by length.
Sent 9: We represent this graph as a matrix and simply loop through the sentences and compare the words.
Sent 15: After ranking all the sentences, the final step is to determine how to determine how to display the shortened article.
Sent 16: The way the post s author did this was by picking the best sentences from each paragraph.
Not bad, but could probably do a little better. We’ll see how it goes.