Back in September, we led a workshop at Pycon UK titled, ‘Natural Language Processing in 10 Lines of Code’. Due to its popularity, we decided to share the tutorial here.
In our previous post, we used some basic techniques to analyse Pride and Prejudice and extract some interesting insights about the characters in the book. In this post, we are going to apply the same analysis techniques to a dataset of real events; the RAND Terrorism dataset - a collection of 40,000 news articles collected from 1968 to 2009 reporting on terrorist activity.
Here are some questions we are going to try to answer during this analysis:
Who are the terrorist groups and other persons mentioned in each article?
What locations are mentioned in each article? Hint: a location just has a different label to a person?
With all of this information, is it possible to plot a figure expressing the relationships between locations and terrorists?
You can find instructions on how to install everything needed for this tutorial in the workshop repository on the Cytora Github.
For speed we have preprocessed the dataset for this task, reducing it to 10033 articles and removing extraneous features of the dataset.
THINGS TO CONSIDER WHEN USING REAL DATA
We are going to use the exact same approach to analyse this dataset as we used in our last post to analyse Pride and Prejudice. However, when using real data there can be some pitfalls and quirks to consider due to inconsistent data quality.
In the previous task, we only used personal named entities to identify characters. For this task, you will need to expand your selection to groups and organisations using the spaCy “ORG” label. Don’t assume this will only give you terrorist groups, as the UN is mentioned many times in this dataset.
Another factor to consider is that a terrorist group might be named differently depending on the author of the article. ‘Al-Qaeda’, ‘Al Qaeda’ and ‘Alqaeda’ all appear in this dataset. We know that these names all refer to the same entity, but spaCy does not. If you want to group together similar names for more consistent data, you could do so using pattern replacement methods.
By extracting the mentions and locations of particular terrorist groups from each article, we can examine terrorist activity by location to understand the risks posed by a certain group over the landscape we are interested in.
GLOBAL INCIDENTS BY TERRORIST GROUP
Using Seaborn, we can create this visualisation of the output of our analysis:
From this visualisation we can extract some key insights;
The Taliban is mentioned in relation to Afghanistan over 1000 times, and the capital city of Kabul 155 times
Hamas is linked closest to Gaza and Israel
Hamas and Palestine are frequently mentioned together, but not as frequently as in relation to Israel
As this dataset only covers 1968 to 2009, Islamic State (ISIS) is not mentioned
If we look at the first column compared to all others, we can see that Al-Qaeda have the widest spread of mentions, being identified in 12 of the 13 areas that we chose to inspect
Despite Al-Qaeda originating in Afghanistan, the 3 highest locations mentioned in relation to it are all in Iraq, or the Iraq itself
As this dataset is spread across a large chunk of time (1968 to 2009), we could infer that Al-Qaeda were not reported as a terrorist group, at least in this dataset, until their activity in Iraq in the last 15 years
It is important to keep in mind that there could be potential data bias from the curators of this dataset, for example, a US-based nonprofit group may have vested interests in a particular inference from this data.
In this analysis, we did not consider adding the subgroups and offshoots of each terrorist group. Doing so might yield a more accurate representation of activity. We could also use the raw unprocessed data, which includes the date for each report, to slice the articles by decade, creating a series of heat maps which analyse terrorist group mentions over time.