Inside This Article

Data Scientists need a whole toolbox of skills to be able to analyze and manipulate data on the job. One of those skills is Natural Language Processing (NLP), which helps machines understand and classify human speech and writing. To learn more, we asked Data Scientist Lauren Washington, who also mentors Thinkful data science students, to explain what NLP actually is, how companies like Google and Facebook use it, and why it’s a useful skill for data scientists. Plus find resources to help you get started with Natural Language Processing!

Meet Lauren:

What is natural language processing or NLP?

NLP is basically feature engineering. This isn't a machine learning algorithm. It’s a way of taking natural text and turning it into something that an algorithm can use. As a data scientist, 80% of your job is being a data janitor and trying to clean things up and turn them into data and features that can be worked with. So natural language processing is a way to process that textual data and turn it into numerical values or categorical values that you can use to actually model text.

A good example is a spam filter for email. A machine can assume that a message is spam or unimportant message based on the frequency count derived from bodies of text.

What are other examples of natural language processing?

Virtual Assistants: I use my virtual assistant to play Spotify, but if I work my commands differently, it reacts differently – that’s natural language processing. It’s using my words to try to analyze what I'm asking it to do. When you put keywords in your request, it will pick those up and throw out the rest, which is how natural language processing works.

Chatbots: Chatbots use natural language processing to map against a database to say, "Okay, usually when somebody asks a question like this, this is the type of answer that they want."


Recommendation engines: This is software that analyzes data to make suggestions to a user based on their interests or browsing habits. Once you start learning more about natural language processing and machine learning, it will be really funny to go to different websites and think, "Oh, they probably just have this one algorithm underneath that's looking at what I'm typing, and that's why this website is able to recommend these things to me."

Natural Language Processing can also do things like sentiment analysis to see the polarity of something – if something someone is saying is really positive, negative, or neutral. Another example is predictive text, which we see on our cell phones every day.

How does natural language processing work?

As long as you know how to do supervised and unsupervised learning, natural language processing should be pretty straightforward.

In a supervised learning model, you already have something that's defined eg. important versus spam in your email inbox. If you create a dataset with important and spam messages, then you can determine the words that are associated with important messages and spam messages, so that when a new email comes in, you can find the similarity between that and the previous emails. A huge part of natural language processing is calculating the similarity between different words/datasets.

Another option is to use an unsupervised learning model and do something like topic modeling. Take a textbook, for example. It has a table of contents, which gives a good indication of what the topics are modeled as. When you see the results of topic modeling on the textbook, you'll see that it pretty much puts them in the same topics that you saw in the table of contents, because of so many similar words in this particular document. For all NLP you usually you take a corpus (every document that's contained), plus whatever you want to analyze, and then you try to find the similarities between different subjects in that.

Another method of NLP is tokenization of texts. Let's say that we have the words "natural language processing is." If we do an n-gram, the first column would be the word "natural", the second column would be the word "language", the third one the word "processing". If we did a bi-gram, the first column would be "natural-language", the second "language-processing", the third, "processing-is" and so forth. So it's the way you would turn the sentence into something that counts the frequency of how many times those different terms (or bigrams, or trigrams) or chunks of those terms are actually occurring.

When you’re just starting out, you’ll use supervised and unsupervised learning. But when you want to get a little bit deeper, you can use something like a neural network, and use recurrent neural networks to be able to do predictive text, or use convolutional neural networks to be able to figure out the context of the text.

What technologies or programming languages are used to build natural language processing pipelines?

Python and R are the most common.

In Python, you can use the Natural Language Toolkit for a lot of things: tokenization, part-of-speech tagging, etc. Beautiful Soup, Spacy and Gensim are great to use for summarizing text, like news articles, or for a bibliography.

In R, there is OpenNLP, RTextTools, and Tokenizers. qDAP, or quantitative discourse analysis is a helpful module or library that you can use as well. You can use Keras or TensorFlow to do things with recurrent neural networks and convolutional neural networks as well.

What are examples of companies which use natural language processing?

Google uses NLP for their search engine. Once you start using natural language processing, you get a lot more efficient at Googling because you start thinking about the keywords that are most important to their algorithm to get the best results.

Facebook and other companies use NLP for their chatbots. If you forgot something in your cart, or if you have a question to ask a business, they have a bot that will immediately answer questions.

At customer service companies, you can use NLP to prioritize tickets by sentiment. Does this person sound really negative? Do they sound happy? Are they using positive keywords? Negative keywords? You can do a summarization of their tickets.

Why is natural language processing useful for beginner data scientists to learn?

Natural Language Processing is really helpful in all contexts because most companies have free text data that they don't necessarily know what to do with yet.

I've noticed that a lot of companies are actually looking for that natural language processing skillset now. A lot of companies have probably had humans spending hours reading through their free text data to try to gauge what's going on. And now we can do it in a more efficient way with data scientists who do natural language processing. That could be for automating a customer service queue and putting the most egregious cases at the top, or for analyzing customers and whether or not they like your products.

What makes someone great at Natural Language Processing?

NLP is definitely a new skillset that's in-demand, but it requires a lot of creativity. There are a lot of modules that are built on plain text, but you must be able to build on top of that, build it out for your particular use case, and be familiar with your business domain.

We offer NLP as a specialization at Thinkful and I get really excited when people choose it. But sometimes I hear people say, "I don't want anything to do with NLP," thinking that it’s going to be really hard. If you really like creativity, you like things that are constantly changing, and still require a lot of research and whitepapers, then natural language processing is a great skill set to have.

How is natural language processing covered in the Thinkful curriculum?

Thinkful data science students have a few opportunities to be exposed to natural language processing in our curriculum:

  1. When we introduce Naive Bayes classification, we do a project about positive or negative reviews, and being able to see how certain words define positivity vs negativity, and how that can be used to classify them for Amazon, IMDb, etc.
  2. In the unsupervised learning module, we start really getting into the deep depths of natural language processing. Students are thrown into it and learn how to do tokenization. We also look at TF-IDF, which is the “term frequency-inverse document frequency,” which you'll hear a lot, and shows the importance of a term showing up in an entire corpus, based on the frequency divided by the number of documents. And you'll learn how to do different things with Spacy, one of the modules I mentioned. You'll see a guided example of how to do the processing of text and implement it in an unsupervised learning model as well.
  3. We also have a mandatory capstone where students have to find data with at least 10 different authors, and need to be able to classify which author belongs to which block of text. I've had people in the past who have taken song lyrics, and been able to predict the artist who is attached to the song lyric using natural language processing. I thought that was pretty cool.
  4. Finally, we offer an NLP specialization. Students can choose between a number of specializations including network analysis, big data, NLP, and time series. So no matter what, you're going to be exposed to NLP, but then you can also choose to specialize in it later.

How can I get started learning natural language processing?

  1. Understand supervised learning and unsupervised learning. You're doing feature engineering on text, so you have to feed them into these supervised and unsupervised models.
  2. Understand distance metrics. There are lot of similarity calculations that say, "This batch of text is very similar to this other batch of text." So just understand the basics – you don't have to go really deep into the math, but know about the derivation of different similarity metrics and how it relates to blocks of text.
  3. Read blogs and read books.
  4. View code online. Start researching on Kaggle and look for places where people have done natural language processing. On Kaggle, you can download data to play with it yourself, look at people's kernels, and see the code that they've already done on that data set. That helps you learn and see the different tactics and methods that people are using.

What advice do you have for people interested in learning natural language processing?

Lauren's advice: Don’t be scared to get into NLP. At my first data science job, I wasn't tasked to work on natural language processing – I was doing a lot of experimentation and A/B testing. I knew that I wanted to get into NLP, so I started my own side projects, started going to marketing teams, asking to analyze their headlines, and things like that. Explore it, keep working at it, and look at different resources online where people actually show you how they work through a problem.

Remember: NLP is just a way of doing feature engineering. If you know how to do machine learning, you can pick up natural language processing, so please don't be discouraged.

Find out more and read Thinkful reviews on Course Report. Check out the Thinkful data science page.

About The Author

Imogen is a writer and content producer who loves writing about technology and education. Her background is in journalism, writing for newspapers and news websites. She grew up in England, Dubai and New Zealand, and now lives in Brooklyn, NY.

related posts