Data Scientists need a whole toolbox of skills to be able to analyze and manipulate data on the job. One of those skills is Natural Language Processing (NLP), which helps machines understand and classify human speech and writing. To learn more, we asked Data Scientist Lauren Washington, who also mentors Thinkful data science students, to explain what NLP actually is, how companies like Google and Facebook use it, and why it’s a useful skill for data scientists. Plus find resources to help you get started with Natural Language Processing!
- Lauren is a Technical Expert and Mentor at Thinkful. She’s also the Lead Data Scientist and Machine Learning Engineer for smartQED in the Bay Area, where she specializes in natural language processing (NLP), predicts when IT systems are going to fail, and recommends solutions.
- Previously, Lauren worked in analytics and data science roles at Google, Nielsen, and the National Opinion Research Center.
- Lauren has a Bachelor’s Degree in Economics from Spelman College, GA, and a Masters in Applied Data Science from Columbia University.
What is natural language processing or NLP?
- Natural Language Processing is a method for pre-processing text to turn it into numerical data. That data can then be modeled using Machine Learning algorithms.
- NLP is also known as computational linguistics.
NLP is basically feature engineering. This isn't a machine learning algorithm. It’s a way of taking natural text and turning it into something that an algorithm can use. As a data scientist, 80% of your job is being a data janitor and trying to clean things up and turn them into data and features that can be worked with. So natural language processing is a way to process that textual data and turn it into numerical values or categorical values that you can use to actually model text.
A good example is a spam filter for email. A machine can assume that a message is spam or unimportant message based on the frequency count derived from bodies of text.
What are other examples of natural language processing?
- Virtual assistants like Siri, Alexa, Google Home
- Recommendation Engines
- Sentiment analysis
- Predictive Text
Virtual Assistants: I use my virtual assistant to play Spotify, but if I work my commands differently, it reacts differently – that’s natural language processing. It’s using my words to try to analyze what I'm asking it to do. When you put keywords in your request, it will pick those up and throw out the rest, which is how natural language processing works.
Chatbots: Chatbots use natural language processing to map against a database to say, "Okay, usually when somebody asks a question like this, this is the type of answer that they want."
Recommendation engines: This is software that analyzes data to make suggestions to a user based on their interests or browsing habits. Once you start learning more about natural language processing and machine learning, it will be really funny to go to different websites and think, "Oh, they probably just have this one algorithm underneath that's looking at what I'm typing, and that's why this website is able to recommend these things to me."
Natural Language Processing can also do things like sentiment analysis to see the polarity of something – if something someone is saying is really positive, negative, or neutral. Another example is predictive text, which we see on our cell phones every day.
How does natural language processing work?
As long as you know how to do supervised and unsupervised learning, natural language processing should be pretty straightforward.
In a supervised learning model, you already have something that's defined eg. important versus spam in your email inbox. If you create a dataset with important and spam messages, then you can determine the words that are associated with important messages and spam messages, so that when a new email comes in, you can find the similarity between that and the previous emails. A huge part of natural language processing is calculating the similarity between different words/datasets.
Another option is to use an unsupervised learning model and do something like topic modeling. Take a textbook, for example. It has a table of contents, which gives a good indication of what the topics are modeled as. When you see the results of topic modeling on the textbook, you'll see that it pretty much puts them in the same topics that you saw in the table of contents, because of so many similar words in this particular document. For all NLP you usually you take a corpus (every document that's contained), plus whatever you want to analyze, and then you try to find the similarities between different subjects in that.
Another method of NLP is tokenization of texts. Let's say that we have the words "natural language processing is." If we do an n-gram, the first column would be the word "natural", the second column would be the word "language", the third one the word "processing". If we did a bi-gram, the first column would be "natural-language", the second "language-processing", the third, "processing-is" and so forth. So it's the way you would turn the sentence into something that counts the frequency of how many times those different terms (or bigrams, or trigrams) or chunks of those terms are actually occurring.
When you’re just starting out, you’ll use supervised and unsupervised learning. But when you want to get a little bit deeper, you can use something like a neural network, and use recurrent neural networks to be able to do predictive text, or use convolutional neural networks to be able to figure out the context of the text.
What technologies or programming languages are used to build natural language processing pipelines?
Python and R are the most common.
In Python, you can use the Natural Language Toolkit for a lot of things: tokenization, part-of-speech tagging, etc. Beautiful Soup, Spacy and Gensim are great to use for summarizing text, like news articles, or for a bibliography.
In R, there is OpenNLP, RTextTools, and Tokenizers. qDAP, or quantitative discourse analysis is a helpful module or library that you can use as well. You can use Keras or TensorFlow to do things with recurrent neural networks and convolutional neural networks as well.
What are examples of companies which use natural language processing?
Google uses NLP for their search engine. Once you start using natural language processing, you get a lot more efficient at Googling because you start thinking about the keywords that are most important to their algorithm to get the best results.
Facebook and other companies use NLP for their chatbots. If you forgot something in your cart, or if you have a question to ask a business, they have a bot that will immediately answer questions.
At customer service companies, you can use NLP to prioritize tickets by sentiment. Does this person sound really negative? Do they sound happy? Are they using positive keywords? Negative keywords? You can do a summarization of their tickets.
Why is natural language processing useful for beginner data scientists to learn?
Natural Language Processing is really helpful in all contexts because most companies have free text data that they don't necessarily know what to do with yet.
I've noticed that a lot of companies are actually looking for that natural language processing skillset now. A lot of companies have probably had humans spending hours reading through their free text data to try to gauge what's going on. And now we can do it in a more efficient way with data scientists who do natural language processing. That could be for automating a customer service queue and putting the most egregious cases at the top, or for analyzing customers and whether or not they like your products.
What makes someone great at Natural Language Processing?
NLP is definitely a new skillset that's in-demand, but it requires a lot of creativity. There are a lot of modules that are built on plain text, but you must be able to build on top of that, build it out for your particular use case, and be familiar with your business domain.
We offer NLP as a specialization at Thinkful and I get really excited when people choose it. But sometimes I hear people say, "I don't want anything to do with NLP," thinking that it’s going to be really hard. If you really like creativity, you like things that are constantly changing, and still require a lot of research and whitepapers, then natural language processing is a great skill set to have.
How is natural language processing covered in the Thinkful curriculum?
Thinkful data science students have a few opportunities to be exposed to natural language processing in our curriculum:
- When we introduce Naive Bayes classification, we do a project about positive or negative reviews, and being able to see how certain words define positivity vs negativity, and how that can be used to classify them for Amazon, IMDb, etc.
- In the unsupervised learning module, we start really getting into the deep depths of natural language processing. Students are thrown into it and learn how to do tokenization. We also look at TF-IDF, which is the “term frequency-inverse document frequency,” which you'll hear a lot, and shows the importance of a term showing up in an entire corpus, based on the frequency divided by the number of documents. And you'll learn how to do different things with Spacy, one of the modules I mentioned. You'll see a guided example of how to do the processing of text and implement it in an unsupervised learning model as well.
- We also have a mandatory capstone where students have to find data with at least 10 different authors, and need to be able to classify which author belongs to which block of text. I've had people in the past who have taken song lyrics, and been able to predict the artist who is attached to the song lyric using natural language processing. I thought that was pretty cool.
- Finally, we offer an NLP specialization. Students can choose between a number of specializations including network analysis, big data, NLP, and time series. So no matter what, you're going to be exposed to NLP, but then you can also choose to specialize in it later.
How can I get started learning natural language processing?
- Understand supervised learning and unsupervised learning. You're doing feature engineering on text, so you have to feed them into these supervised and unsupervised models.
- Understand distance metrics. There are lot of similarity calculations that say, "This batch of text is very similar to this other batch of text." So just understand the basics – you don't have to go really deep into the math, but know about the derivation of different similarity metrics and how it relates to blocks of text.
Read blogs and read books.
- Natural Language Processing with Python is key
- If you're an R programmer, RPubs is great
- I love towardsdatascience.com – they have a ton of tutorials on natural language processing and t really great feature engineering tutorials that get straight to the point on things like topic modeling, similarity, and tokenization.
- There's a great PDF out there called NLP for Hackers.
- View code online. Start researching on Kaggle and look for places where people have done natural language processing. On Kaggle, you can download data to play with it yourself, look at people's kernels, and see the code that they've already done on that data set. That helps you learn and see the different tactics and methods that people are using.
What advice do you have for people interested in learning natural language processing?
Lauren's advice: Don’t be scared to get into NLP. At my first data science job, I wasn't tasked to work on natural language processing – I was doing a lot of experimentation and A/B testing. I knew that I wanted to get into NLP, so I started my own side projects, started going to marketing teams, asking to analyze their headlines, and things like that. Explore it, keep working at it, and look at different resources online where people actually show you how they work through a problem.
Remember: NLP is just a way of doing feature engineering. If you know how to do machine learning, you can pick up natural language processing, so please don't be discouraged.
How to Land Data Science Jobs in Seattle
Tips and tricks from a Metis alum and a career coach!
What are Data Structures?
Hack Reactor instructor Fred Zirdung explains arrays, trees, objects, graph data structures, & more.
33 Summer Coding Bootcamps
Learn to code this summer (updated for 2019)!