Natural Language Processing In 10 Lines Of Code: Part 1

Back in September we led a workshop at Pycon UK titled, ‘Natural Language Processing in 10 Lines of Code’. Due to its popularity we decided to share the tutorial here.

Natural language processing (NLP) is a way for computers to analyse, understand, and derive meaning from human language. At Cytora we use NLP to extract and analyse unstructured web data, such as news articles, and turn them into structured data sets that organisations can use to measure things like economic and social change.

In a previous post, we briefly discussed how NLP can be used to sift through large amounts of written data and extract insights. In this post, we are going to show you how to do it at a basic level using spaCy,  a powerful NLP library.

In this post, we will analyse the well-known text, Pride and Prejudice, and extract some interesting insights about the characters in the book. For those who are not familiar with Pride and Prejudice, it is a famous novel by Jane Austen that was first published in 1813. The story revolves around the lives of the five Bennet daughters after two gentlemen move into their neighborhood: the rich and eligible Mr. Bingley, and his status-conscious friend, the even richer and more eligible Mr. Darcy.

You can find instructions on how to install everything needed for this tutorial in the workshop repository on the Cytora Github. 

Here are our objectives for this analysis of Pride and Prejudice;

  • Extract the character names from the book (e.g. Elizabeth, Barcy, Bingley)

  • Visualise character occurrences with regards to their relative position in the book (e.g. are specific characters mentioned more in the beginning of the book and others more towards the end?)

STEP 1: PROCESS THE TEXT

Our first task is to parse Pride and Prejudice using the spaCy NLP parser. The following code will perform tokenization and sentence identification, giving us access to things like speech tags, syntactic dependency trees, and named entities.

import spacy

def read_file(file_name):
   with open(file_name, 'r') as file:
       return file.read()

# Process `text` with Spacy NLP Parser
text = read_file('data/pride_and_prejudice.txt')
nlp = spacy.load('en')
processed_text = nlp(text)

STEP 2: FIND ALL OF THE CHARACTERS NAMES

Next, we want to extract all of the characters' names from the book. We can do this using spaCy named entities. Named-entity recognition (NER) is a text information area which seeks to identify named entities, such as "Mr. Darcy", and classify them into predefined categories, such as the names of people, organisations, locations, and quantities.

Every spaCy named entity will have a label, such as ‘person’ or ‘product’. We only want to extract the characters names, so we will only collect ‘personal entity’ named entities.

from collections import Counter

def find_character_occurences(doc):
    """
Return a list of characters from `doc` with 
corresponding occurences.
    
    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), 
        ('jane', 286)]
    """
    
    characters = Counter()
    for ent in processed_text.ents:
        if ent.label_ == 'PERSON':
            characters[ent.lemma_] += 1
            
    return characters.most_common()

print(find_character_occurences(processed_text)[:20])

STEP 3: PLOT the NAMES AS A TIME SERIES

In step 2, we explained how to extract all the characters from the text. Our second objective is to visualise the characters with regard to their relative position in the book. To achieve this, we can rewrite the function (find_actor_occurences) to store token offset for every personal named entity.

Token offset is a number which represents the position of the token in the text. To visualise where the characters appear in the book, we can use a function; plot_actor_timeseries, which accepts the results of (get_actors_offsets) and a list of the names we want to visualise.

def get_character_offsets(doc):
    """
    For every character in a `doc` collect all 
    the occurences offsets and store them into a list. 
    The function returns a dictinary that has character 
    lemma as a key and list of occurences as a value 
    for every character.
    
    :param doc: Spacy NLP parsed document
    :return: dict object in form
        {'elizabeth': [123, 543, 4534], 'darcy': 
        [205, 2111]}
    """
    
    character_offsets = defaultdict(list)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            character_offsets[ent.lemma_].append(ent.start)
            
    return dict(character_offsets)

character_occurences = get_character_offsets(processed_text)

Here are a few insights that we can derive from this time series;

  • Mr. Bingley is the prominent male actor at the beginning of the story. This is due to the fact that he expresses his sympathy to Jane immediately at the beginning of the book.

  • The love story between Elizabeth and Mr. Darcy is more complicated than the story between Mr. Bingley and Jane. This is reflected in the violent zig zag of the time series.

  • Elizabeth is the main character of the book, and this is why her love interest, Mr. Darcy, appears more frequently than Mr. Bingley in the central part of the book.

  • Both Mr. Darcy and Mr. Bingley remain important characters throughout the story, which is reflected by the similar amount of mentions they both have towards the end of the book. This is due to the fact that, *spoiler alert*, both couples end up getting married.

If you would like to learn how to describe Mr. Darcy using spaCy parse tree dependencies, you can check out the full workshop on our Github repository.

Congratulations, you are now ready to start developing your own natural language processing projects in Python. If you managed to follow along and produce your own literature derived time series, please share it with us on Twitter.

In our next NLP post, we will apply these same techniques to a dataset of real events; the RAND Terrorism Dataset.