At Cytora, we use the web to measure what is changing in the world. We do this by extracting important events - such as fires or cyber attacks - from web data - such as news articles.
This is no easy task; the internet is awash with authors of varying quality and style. Poor quality content, such as fake news, can make it difficult for machine and human readers to efficiently extract real insights about the world.
How do we filter out the junk?
Basic natural language processing (NLP) techniques allow us to set up rules which filter out most of the irrelevant content, but in order to create an accurate representation of world events, we need to plough through millions of pieces of data every day. This requires a more robust approach.
The challenge of algorithmically extracting real-world events from a deluge of textual data is simplified by concentrating on words that are information-dense; such as verbs and nouns, as they often describe ‘actions’ and ‘agents’.
One possible solution is to reduce sentences into triples.
A triples algorithm is a machine learning technique that enables us to reduce any sentence to three words. For researchers and analysts who deal with web data, this is a powerful tool.
How do triples work?
Triples distil a sentence down to three vital words; an object, a subject, and a predicate. As an example, we’ll take this snippet of breaking news:
"Technology giant Disruption Inc. has announced the layoff of over 10,000 of its staff.”
Almost all sentences contain a predicate and a subject. When the predicate is a verb, its subject is the entity that performs the action. In this example, we can identify the subject as Disruption Inc., which is performing an action - to announce.
The object of the sentence is the entity that the action is done unto. Here it is layoffs. Taking these three items, we can reduce our sentence down into it’s triple:
Disruption Inc., announced, layoff
In order to do this on a large scale, we must be able to determine this triple automatically. By using standard natural language processing (NLP) techniques we split a sentence into its constituent words and describe their relationship.
Each word or punctuation mark is a token and all of the tokens are linked via dependencies. The dependency graph for our example might look like this:
A simple algorithm would begin by locating a relevant noun subject and following the dependency links to locate the predicate and object. Now we’ve got the theory, let’s take a look at a recent example, which appeared in the Financial Times:
“The move by Rio came after it emerged on Wednesday that lawyers working for the miner had uncovered internal emails about the questionable $10.5m payment more than a year ago”
This short snippet includes some potentially useful temporal information (Wednesday, after) and details ($10.5m). The sentence also contains words that require additional context (the move), which we'll ignore here. We want to distil this sentence down to an object, a subject, and a predicate.
The event, as underlined in the sentence, is summarised by the following triple:
Lawyers, uncovered, emails
Triples dramatically improve text mining and analysis
Triples provide a structured way to search through large corpuses of documents and ask some interesting questions such as;
What did Apple do? - e.g. announced, closed, hired, fired
What do lawyers do? - e.g. draft, approve, uncover
What can happen to earnings? - e.g. release, disappoint, cheer
Who has hired? - e.g. Apple, Microsoft, Netflix
Using triples, we are able to uncover things that a simple keyword search can not reveal, and mine huge quantities of data in order to construct knowledge.
There are many use cases for this; a venture capital fund could use this technique to search the web and identify companies who have completed a funding round, or a risk analyst within an asset manager could use it to identify specific companies that have experienced layoffs.
When it comes to text analysis, triples can dramatically improve efficiency and in turn add value to many business processes.