stop sign out in field

Stop words are the words that are filtered out when a computer is doing natural language processing.

Which words are stop words?

There is no single list of stop words. The stop words you use will vary depending on the specific project you were working on.

In python, there is a library called the natural language tool kit (NLTK) that is very popular for doing natural language processing. Below is a list of words the natural language tool kit uses as stopwords.

[“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “his”, “himself”, “she”, “her”, “hers”, “herself”, “it”, “its”, “itself”, “they”, “them”, “their”, “theirs”, “themselves”, “what”, “which”, “who”, “whom”, “this”, “that”, “these”, “those”, “am”, “is”, “are”, “was”, “were”, “be”, “been”, “being”, “have”, “has”, “had”, “having”, “do”, “does”, “did”, “doing”, “a”, “an”, “the”, “and”, “but”, “if”, “or”, “because”, “as”, “until”, “while”, “of”, “at”, “by”, “for”, “with”, “about”, “against”, “between”, “into”, “through”, “during”, “before”, “after”, “above”, “below”, “to”, “from”, “up”, “down”, “in”, “out”, “on”, “off”, “over”, “under”, “again”, “further”, “then”, “once”, “here”, “there”, “when”, “where”, “why”, “how”, “all”, “any”, “both”, “each”, “few”, “more”, “most”, “other”, “some”, “such”, “no”, “nor”, “not”, “only”, “own”, “same”, “so”, “than”, “too”, “very”, “s”, “t”, “can”, “will”, “just”, “don”, “should”, “now”]

As you can see, the words in the list are very common words. For the most part, if you remove these words from a sentence you can still get an idea of what the intent is of the sentence.

For example if you say the sentence “Come over to my house”, You can remove the stop words (“over”, “to”, “my”), and end up with a sentence “Come House”. You could then interpret the sentence as come over to my house, but you’ve done it with only two words.

Why are stop words not always good?

Any time you start removing words from a sentence, there is a chance you lose some of the meaning.

In our previous example where “Come over to my house” changed to “Come house”, it is no longer as clear what the person is trying to say.

Is the person asking the house to follow them? or are they telling a person to head over to their house?

This is why It could be problematic to use stop words. Or at least why it can be dangerous to use someone else’s static list of stopwords.

Why use stop words?

Stop words are your opportunity to better optimize your natural language processing.

When you start reviewing the text you are processing, you will find some words that are used very often and may not add Much to the meaning of the sentences.

However, every word you leave in the sentence to be processed, increases the time it takes to process the text, and the disk space required to store the resulting sentence.

How to select which stop words to use

A good strategy for selecting stopwords to use is a strategy called collection frequency.

You measure the total number of times each term appears in your text, and start removing frequent terms that don’t add much value. This is a process that’s best done manually versus programmatically.

Example use case for stop words

Assume for a moment you are building a search engine application. You have tens of thousands of documents that you need to index.

When you create an index, you are extracting relevant terms in each document in order to make it easier to locate documents with specific text in them.

As you build your index, every term you add to the index is duplicating content from the documents. With a small data set, this might not have much of an impact. However, as your data set grows this can have a significant effect on the size of your index.

As your index grows, two things happen, It takes up more space on your hard disk, And it takes more time to iterate through the index to locate the documents you care about.

If you start removing common words like those listed in the first section of this article, You can have a significant reduction in the time to build the index, how much space it takes to store the index, How quickly you can search the index, And the quality of your search results.

If you search a document for the phrase “go to the store”, The terms you probably care about our “go” and “store”. Most likely, you do not want to return all the pages with the words “to” and “the” Because that would include every document in your data set.

Summary

Stop words are a useful tool for optimizing your natural language processing project. When used correctly they Can make your application run better, faster, or cheaper.

The main thing to remember, is you need to put some thought into what stopwords you are using to make sure you are not negatively impacting the quality of your results.