Before We Begin: A Nod to Turing
Long before smart assistants and language models, one man asked a bold question: "Can machines think?" Sir Alan Turing, a pioneer of modern computing, introduced the Turing Test in 1950 — a challenge to see whether a machine could exhibit intelligent behavior indistinguishable from a human. At the heart of this test lies natural language understanding: the ability to recognize, comprehend, and respond to human language. In many ways, the field of Natural Language Processing (NLP) was born from this question. Today, we explore how machines begin to decode language — step by step — just as Turing once imagined.
NLP Basics - Scenario 1
Decoding Language: A Journey Through NLP Concepts with Python
Imagine you're working at a tech startup and your team is building a smart assistant that understands news
articles and extracts meaningful insights from them.
One day, the assistant stumbles upon a headline: "Apple is looking at buying U.K. startup for $1
billion."
Your task? Break this down into a structured form so the assistant knows:
- Who is involved?
- What is the action?
- What are the entities?
Let the journey begin. We'll walk through the steps of natural language understanding using this very sentence,
with a Python-powered tour using NLTK and spaCy.
Step 1: Tokenization - Breaking It Down
Before any NLP model can understand or analyze text, it must first break the sentence into individual components,
or "tokens." Tokenization is like splitting a paragraph into words and punctuation, enabling us to work with
language programmatically.
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentence = "Apple is looking at buying U.K. startup for $1 billion."
tokens = word_tokenize(sentence)
print(tokens)
Output:
['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.']
Each word, punctuation mark, and number is now an element in a list — ready for further analysis.
Step 2: POS Tagging - Understanding Grammar
Once we have our tokens, the next step is to understand the grammatical role each one plays. POS (Part-of-Speech)
tagging assigns a label to each token — such as noun, verb, adjective — to help models understand syntax and
structure.
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
Output:
[('Apple', 'NNP'), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), ('U.K.', 'NNP'), ('startup', 'NN'), ('for', 'IN'), ('$', '$'), ('1', 'CD'), ('billion', 'CD'), ('.', '.')]
This gives a grammatical structure to our sentence. Knowing "Apple" is a proper noun and "looking" is a verb
helps models and humans alike understand the sentence context.
Step 3: Stopword Removal - Filtering the Noise
Natural language is full of high-frequency words like "is", "at", and "for" that are important in conversation
but less useful in information extraction. These are called stopwords.
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
Output:
['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.']
By removing stopwords, we focus only on the content-carrying words, simplifying downstream tasks like
classification or summarization.
Step 4: Stemming vs Lemmatization - Finding the Root
Languages are full of variations: "buy", "buying", and "bought" are all forms of the same root word. To normalize
these, we use stemming and lemmatization.
Stemming:
Stemming crudely chops off word endings to reduce words to their base form.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed)
Output:
['appl', 'look', 'buy', 'u.k.', 'startup', '$', '1', 'billion', '.']
Notice how "Apple" becomes "appl" — a real downside when working with brand names or proper nouns.
Lemmatization:
Lemmatization uses vocabulary and context to bring words to their proper root form.
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized)
Output:
['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.']
Lemmatization is context-sensitive and produces dictionary forms — a cleaner solution for most NLP tasks.
Step 5: Chunking - Finding Phrases
Chunking, or shallow parsing, takes POS-tagged words and groups them into higher-level units like noun phrases
(NP) or verb phrases (VP). This helps us capture meaning in chunks rather than isolated words.
grams = nltk.ne_chunk(pos_tags)
grams.pprint()
Output (tree):
(S
(GPE Apple/NNP)
is/VBZ
looking/VBG
at/IN
buying/VBG
(GPE U.K./NNP)
startup/NN
for/IN
$/$
1/CD
billion/CD
./.)
Here, "Apple" and "U.K." are recognized as named locations (GPE = geopolitical entity), and we can begin to see
structure emerging from our flat sentence.
Step 6: NER (Named Entity Recognition) - Extracting Real-World Things
Named Entity Recognition takes chunking a step further by labeling known entities like organizations, places,
people, dates, and monetary amounts. spaCy is excellent for this task.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, "->", ent.label_)
Output:
Apple -> ORG
U.K. -> GPE
$1 billion -> MONEY
Now we know "Apple" is an organization, "U.K." is a location, and "$1 billion"
is a monetary value — critical insights for any application.
Step 7: Word Embeddings - Understanding Meaning
To compare words semantically, we convert them into vectors using word embeddings. These vectors capture how
similar or related words are, based on usage in large corpora.
print(doc[0].text, "\nVector:\n", doc[0].vector[:5]) # Truncated for display
Output:
Apple
Vector:
[0.122, -0.334, ..., 0.238]
This dense numeric representation is the foundation for deep learning models to understand language in terms of
context and relationships.
Conclusion: Turning Words into Insight
What started as a raw sentence has now been:
- Tokenized into words
- Tagged with grammatical meaning
- Cleaned of stopwords
- Normalized through stemming and lemmatization
- Chunked into phrases
- Analyzed for real-world entities
- Represented as vectors
Through this journey, your smart assistant now knows that Apple (an organization) is considering a financial move
involving a U.K. startup, and the deal size is $1 billion.
Further Reading
For more in-depth information on NLP and related topics, consider exploring:
- NLP Glove, Bert, TF-IDF, LSTM Explained - A comprehensive guide to NLP techniques using Glove, Bert, TF-IDF, and LSTM.
- NLTK - A leading platform for building Python programs to work with human language data.
- spaCy - An open-source library for natural language processing in Python.
- spaCy 101 - A tutorial for getting started with spaCy.
- NLTK Book - A comprehensive guide to natural language processing with NLTK.
- NLTK Corpora - NLTK has built-in support for dozens of corpora and trained models.
- English Corpora - A collection of corpora for English language.
- Full-Text Corpus Data - downloadable corpora of english and other languages.
More scenarios would be keep coming as we move along.