Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Open AGI Codes | Your Codes Reflect! | Transforming Tomorrow, One Algorithm at a Time: The AI Revolution | NLP
[go: Go Back, main page]

loader

Do you want to check out our featured section?

Featured

Before We Begin: A Nod to Turing

Long before smart assistants and language models, one man asked a bold question: "Can machines think?" Sir Alan Turing, a pioneer of modern computing, introduced the Turing Test in 1950 — a challenge to see whether a machine could exhibit intelligent behavior indistinguishable from a human. At the heart of this test lies natural language understanding: the ability to recognize, comprehend, and respond to human language. In many ways, the field of Natural Language Processing (NLP) was born from this question. Today, we explore how machines begin to decode language — step by step — just as Turing once imagined.

NLP Basics - Scenario 1

Decoding Language: A Journey Through NLP Concepts with Python

Imagine you're working at a tech startup and your team is building a smart assistant that understands news articles and extracts meaningful insights from them.

One day, the assistant stumbles upon a headline: "Apple is looking at buying U.K. startup for $1 billion."

Your task? Break this down into a structured form so the assistant knows:

  • Who is involved?
  • What is the action?
  • What are the entities?

Let the journey begin. We'll walk through the steps of natural language understanding using this very sentence, with a Python-powered tour using NLTK and spaCy.

Step 1: Tokenization - Breaking It Down

Before any NLP model can understand or analyze text, it must first break the sentence into individual components, or "tokens." Tokenization is like splitting a paragraph into words and punctuation, enabling us to work with language programmatically.

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

sentence = "Apple is looking at buying U.K. startup for $1 billion."
tokens = word_tokenize(sentence)
print(tokens)

Output:

['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.']

Each word, punctuation mark, and number is now an element in a list — ready for further analysis.

Step 2: POS Tagging - Understanding Grammar

Once we have our tokens, the next step is to understand the grammatical role each one plays. POS (Part-of-Speech) tagging assigns a label to each token — such as noun, verb, adjective — to help models understand syntax and structure.

nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

Output:

[('Apple', 'NNP'), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), ('U.K.', 'NNP'), ('startup', 'NN'), ('for', 'IN'), ('$', '$'), ('1', 'CD'), ('billion', 'CD'), ('.', '.')]

This gives a grammatical structure to our sentence. Knowing "Apple" is a proper noun and "looking" is a verb helps models and humans alike understand the sentence context.

Step 3: Stopword Removal - Filtering the Noise

Natural language is full of high-frequency words like "is", "at", and "for" that are important in conversation but less useful in information extraction. These are called stopwords.

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

Output:

['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.']

By removing stopwords, we focus only on the content-carrying words, simplifying downstream tasks like classification or summarization.

Step 4: Stemming vs Lemmatization - Finding the Root

Languages are full of variations: "buy", "buying", and "bought" are all forms of the same root word. To normalize these, we use stemming and lemmatization.

Stemming:

Stemming crudely chops off word endings to reduce words to their base form.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed)

Output:

['appl', 'look', 'buy', 'u.k.', 'startup', '$', '1', 'billion', '.']

Notice how "Apple" becomes "appl" — a real downside when working with brand names or proper nouns.

Lemmatization:

Lemmatization uses vocabulary and context to bring words to their proper root form.

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized)

Output:

['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.']

Lemmatization is context-sensitive and produces dictionary forms — a cleaner solution for most NLP tasks.

Step 5: Chunking - Finding Phrases

Chunking, or shallow parsing, takes POS-tagged words and groups them into higher-level units like noun phrases (NP) or verb phrases (VP). This helps us capture meaning in chunks rather than isolated words.

grams = nltk.ne_chunk(pos_tags)
grams.pprint()

Output (tree):

(S
(GPE Apple/NNP)
is/VBZ
looking/VBG
at/IN
buying/VBG
(GPE U.K./NNP)
startup/NN
for/IN
$/$
1/CD
billion/CD
./.)

Here, "Apple" and "U.K." are recognized as named locations (GPE = geopolitical entity), and we can begin to see structure emerging from our flat sentence.

Step 6: NER (Named Entity Recognition) - Extracting Real-World Things

Named Entity Recognition takes chunking a step further by labeling known entities like organizations, places, people, dates, and monetary amounts. spaCy is excellent for this task.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)

for ent in doc.ents:
    print(ent.text, "->", ent.label_)

Output:

Apple -> ORG
U.K. -> GPE
$1 billion -> MONEY

Now we know "Apple" is an organization, "U.K." is a location, and "$1 billion" is a monetary value — critical insights for any application.

Step 7: Word Embeddings - Understanding Meaning

To compare words semantically, we convert them into vectors using word embeddings. These vectors capture how similar or related words are, based on usage in large corpora.

print(doc[0].text, "\nVector:\n", doc[0].vector[:5])  # Truncated for display

Output:

Apple 
Vector:
[0.122, -0.334, ..., 0.238]

This dense numeric representation is the foundation for deep learning models to understand language in terms of context and relationships.

Conclusion: Turning Words into Insight

What started as a raw sentence has now been:

  • Tokenized into words
  • Tagged with grammatical meaning
  • Cleaned of stopwords
  • Normalized through stemming and lemmatization
  • Chunked into phrases
  • Analyzed for real-world entities
  • Represented as vectors

Through this journey, your smart assistant now knows that Apple (an organization) is considering a financial move involving a U.K. startup, and the deal size is $1 billion.

Further Reading

For more in-depth information on NLP and related topics, consider exploring:

  • NLP Glove, Bert, TF-IDF, LSTM Explained - A comprehensive guide to NLP techniques using Glove, Bert, TF-IDF, and LSTM.
  • NLTK - A leading platform for building Python programs to work with human language data.
  • spaCy - An open-source library for natural language processing in Python.
  • spaCy 101 - A tutorial for getting started with spaCy.
  • NLTK Book - A comprehensive guide to natural language processing with NLTK.
  • NLTK Corpora - NLTK has built-in support for dozens of corpora and trained models.
  • English Corpora - A collection of corpora for English language.
  • Full-Text Corpus Data - downloadable corpora of english and other languages.
More scenarios would be keep coming as we move along.

Check out updates from AI influencers