PyFriday Tutorial: Text Data Types in NLP
Welcome to PyFriday! If this is your first time exploring Natural Language Processing (NLP), you’re in the right place. NLP is all about teaching computers to understand, interpret, and interact with human language. To do that effectively, we use various text data types to represent and process text. Before we dig into the big NLP topics that will go from data cleaning to building out your own Machine Learning models, we’ll need to take some time to learn about these text data types.
Please keep in mind, while we aim to keep this tutorial very beginner friendly, there is some assumptions of experience in Python. If you’re completely new to Python, we recommend checking out our earlier PyFriday entries!
Normally, we host all code presented on Cody’s Github.This time however, since it’s a bunch of smaller examples, we’ll keep the examples in the article.
Table of Contents
- What is Text Data in NLP?
- Strings: Definition, Use Cases, and Examples
- Lists of Strings: Definition, Use Cases, and Examples
- Dictionaries: Definition, Use Cases, and Examples
- DataFrames: Definition, Use Cases, and Example
- Token Objects: Tokens, Use Cases, and Examples
- Sparse Matrices: Efficient Data Structures, Use Cases, and Examples
- Vectors and Embeddings: Dense Vector Representations, Use Cases, and Examples
- Sequences: Ordered Collections for Modeling and Examples
What is Text Data in NLP?
In NLP, text data refers to the words, sentences, or paragraphs we process. Computers can’t understand text like humans do—they need structured representations of text to work with it. These representations are the text data types we’ll discuss.
Imagine you want a computer to analyze this sentence:
“Welcome to PyFriday!”
The computer must:
- Store the text.
- Break it into parts (e.g., words or letters).
- Analyze it (e.g., find the most important words).
- Apply meaning (e.g., recognize “PyFriday” as a name).
We use different data processing for each step, which often includes breaking up the data into several types.
Strings: Definition, Use Cases, and Examples
What Are Strings?
A string is a basic data type that represents text. You can think of it as a single piece of text, like a word, sentence, or paragraph. Computers use strings to store and manipulate text, and can be as short as one letter, to entire books.
Why Use Strings in NLP?
Strings are fundamental for:
- Storing raw text data, like a tweet or a paragraph. It’s basically as fundamental a data type as integers or boolean values.
- Doing simple text processing, such as converting to lowercase or removing punctuation.
Example
Here, we assign a sentence or string as text, and print it. Next, we turn all letters lowercase and print again. Lastly, we split the string into a list of tokens (both terms will be discussed shortly), and print that. These are quick examples of how string data is read in, and can be cleaned.
Lists of Strings: Definition, Use Cases, and Examples
What Are Lists of Strings?
A list is like a container that can hold multiple pieces of text (strings). When you split a sentence into words, you usually end up with a list of strings.
Why Use Lists of Strings in NLP?
- Storing tokenized text (text broken into smaller pieces like words).
- Processing multiple pieces of text in order (e.g., sentences in a paragraph).
- Can better organize/categorize data (e.g. a list of strings can give you an organized way to look at chief complaints by visit).
- These can be further abstracted into lists of lists of strings, like if you want to see chief complaint by patient over time, by hospital.
Examples
Here, we take our original sentence, split it by word to form a list, and then rejoin the words from a list back into a string, with a space between each list entry to make things intelligible.
Dictionaries: Definition, Use Cases, and Examples
What Are Dictionaries?
A dictionary is a data type that pairs values with keys. Think of it like a mini-database where each key has a value. In NLP, dictionaries are often used to store information about words, like their frequency in a document.
Why Use Dictionaries in NLP?
- To count words (important for understanding which words are common or rare).
- To map words to other data, like their synonyms or embeddings (discussed later).
Examples
For this example, we’re taking a sentence and putting it into a variable. Then, we set up an empty dictionary called word_counts. From there, we use a for loop to iterate over every word, and each time a word shows up, we add to its count value in the dictionary.
DataFrames: Definition, Use Cases, and Example
What Are DataFrames?
A DataFrame is a table-like data structure (think of an Excel sheet). It’s great for handling text data with additional information, such as labels or metadata.
Why Use DataFrames in NLP?
- To organize large datasets, like medical records with lots of subfields.
- To store text and related attributes for machine learning, analysis, etc.
Example
Token Objects: Tokens for NLP, Use Cases, and Example
A token object is a specialized representation of a word or a segment of text used by NLP libraries such as spaCy or NLTK. Tokens go beyond just being individual words—they include additional information that helps computers understand the structure and meaning of the text. For instance, a token might know whether it’s a noun, verb, or adjective, its base form (lemma), or even its position in a sentence.
When raw text is processed by an NLP library, the library “tokenizes” the text, breaking it into smaller parts (tokens). Each token becomes a rich object containing metadata about that piece of text.
Key Features of Token Objects
- Text Content: The actual word or segment of text represented by the token.
- Lemma: The base form of the word (e.g., “running” becomes “run”).
- Part of Speech (POS): The grammatical category of the word (e.g., noun, verb, adjective).
- Dependencies: Relationships between words in the sentence (e.g., subject, object).
- Offsets: Start and end positions of the token in the original text.
- Attributes: Whether the token is a punctuation mark, stop word (common but unimportant word like “the” or “and”), or other special text.
Why Use Token Objects in NLP?
Token objects are invaluable because they allow detailed linguistic analysis without having to write custom code for parsing and annotating text. These objects make it easy to:
- Analyze grammatical structure: Understand how words function in a sentence.
- Extract metadata: Retrieve information like lemmas and part-of-speech tags for downstream tasks.
- Annotate text efficiently: Use built-in attributes to quickly identify features like stop words or punctuation. Some functions within NLTK and SpaCy actually do the part of speech annotation for you in some circumstances, making some tasks much less troublesome.
Examples
For this example, we’re loading in Spacy and from it, loading in a small English pipeline that’s very feature rich. We then use this pipeline to take core data from our sentence using the nlp method. Then for each word, we print out the text, the lemma, and the part of speech the word or symbol belongs to.
Sparse Matrices: Efficient Data Structures, Use Cases, and Examples
What Are Sparse Matrices?
A sparse matrix is a compact way to store data with lots of zeroes. In NLP, it’s used to represent text as numbers, like in the bag-of-words model, where each word’s presence is encoded as 1 (present) or 0 (absent). It should be noted that “compact” does not mean small; While the storage requirements are usually much smaller than some other matrix based interaction tables, for large corpi you can quickly hit a matrix with billions of cells.
Why Use Sparse Matrices in NLP?
- To efficiently store large datasets.
- To prepare text for simpler machine learning models. Dense Matrices, which are much less memory efficient are often used for more complex tasks.
Examples
This one is a bit trickier without some underlying knowledge, but let’s dig into it. First, we’ll import SciKitLearn, a machine learning package, and assign the tool for converting text into a numerical matrix (CountVectorizer()) to an easier to use “vectorizer”. Then, we have our documents that we want to fit into a matrix. When we say fit, we mean it looks at the word choices for each document and identifies each unique word, and converts this information into a matrix showing if any of the unique words actually show up in the document. In this case, the words “fun”, “nlp”, “is”, “with” and “pyfriday” are all unique terms. Normally this function will also arrange words in alphabetical order, so we need to keep that in mind when examining whether a word is or is not present in the subsequent matrix.
Vectors and Embeddings: Dense Vector Representations, Use Cases, and Examples
What Are Vectors and Embeddings?
In NLP, vectors are numerical representations of text. They are a way to convert words, phrases, or even entire documents into a format that machines can process: arrays of numbers. Each dimension in a vector represents some aspect of the text, such as its occurrence in a corpus, its relationship to other words, or its position in a semantic space.
Embeddings are a specific type of vector representation that go beyond simple numerical representations like one-hot encoding or term frequency. Embeddings are dense vectors learned from data, and they capture the semantic meaning of words by placing similar words closer together in the vector space.
Why Use Vectors and Embeddings in NLP?
- To represent text numerically for machine learning.
- To capture relationships between words (e.g., “king” and “queen” are related).
Examples
here, we have a slightly different bit of code than we did with the sparse matrix. Namely, we’re using a vectorizer that takes into account relative importance as well as position, so we don’t just have 1’s and 0’s, but rather the relationship dynamic as well that ranges from 0-1 showing how critical a term is to the text.
Sequences: Ordered Collections for Modeling, and Examples
What Are Sequences?
A sequence in NLP refers to an ordered list of elements that can represent words, characters, or tokens. Sequences preserve the order in which these elements appear in the text, which is crucial for understanding the context and meaning of the text. Unlike individual words or isolated tokens, sequences allow models to capture the relationships between elements over time or across positions.
Sequences are used in models designed to process and understand text step-by-step, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and modern architectures like Transformers.
Why Use Sequences in NLP?
- Language inherently depends on order. For example:
-
- The sentence “I ate the cake” is vastly different from “The cake ate I.”
- Words gain meaning from their position and their relationship with surrounding words.Language inherently depends on order. For example:
- Sequences help capture:
- Context: Understanding how earlier elements in the sequence influence later ones (e.g., “not happy” conveys a different meaning than “happy”).
- Order: The specific arrangement of elements matters (e.g., the difference between “He eats quickly” and “Quickly, he eats”).
- Dependency: Words or tokens often depend on one another over long distances in a text.
Examples
This itself is an incredibly simple example, mostly as we’ll cover sequences much later and it’ll require much more depth. For now though, the most basic of basics will do.
Conclusion
In this tutorial, we explored the foundational text data types used in Natural Language Processing (NLP), starting with basic strings to represent raw text, followed by lists of strings for tokenized words or sentences. We then delved into dictionaries, which map words to values like frequencies, and DataFrames, which help organize large datasets with metadata. We also introduced token objects from NLP libraries, providing rich linguistic details, and sparse matrices, which efficiently store numerical representations of text. Finally, we covered vectors and embeddings, which capture semantic meanings, and sequences, essential for ordered text processing in advanced models. By understanding these data types, you’ve built a strong foundation to process, analyze, and model textual data effectively.
Happy learning, and see you at the next PyFriday! 🚀
Humanities Moment
The featured image for this PyFriday is Manuscript Book mural in Evolution of the Book series (1896) by John White Alexander (American, 1856-1915). John White Alexander was an American painter and illustrator known for his portraits, decorative works, and murals. Orphaned as a child, he was raised by his grandparents and later mentored by Edward J. Allen, who nurtured his artistic talent. Starting as an illustrator for Harper’s Weekly, he pursued formal training in Europe, gaining influence from artists like Whistler and Duveneck. Returning to New York, he achieved acclaim for his portraits of prominent figures and murals like “Apotheosis of Pittsburgh,” with his works displayed in major museums across the United States and Europe.