Using The Internet As A Dataset To Price Risk

Risks today are either prominent and well-managed or emerging and misunderstood. The ‘middle’ of risk is disappearing, and gathering comprehensive data from new sources is now central to an insurer's ability to price effectively. Adjustor notes, police reports, and medical records are already being used for this purpose, but the largest dataset ever created remains severely underutilised: the internet.

Web data offers peculiar and specific value. Thanks to the proliferation of technology and the dwindling cost of computation, almost every significant loss event that takes place in the world is being recorded and published online. The volume of observations per insurable risk are skyrocketing.

Consider this; close to 100% of all product recalls and changes in business exposure are reported online, and if you know where to look, you can find granular information about almost every high profile cyber attack that has occurred within the past five years on the internet.


The commercial opportunity should be seen in the context of building analytic sophistication in the underexplored ‘middle’ of risk [1]. Currently, a disjuncture exists whereby capital is seeking to enter new, under-capacity risk verticals but insurers lack the analytic assurance to price with confidence.

Using data available on the web, insurers can significantly increase their observation power across different areas of the insurance value chain, encompassing risk pricing, underwriting and accumulation management. One important fact to consider is that the benefits of web data are asymmetric; for some insurance lines, web-based datasets offer huge observation power, but in others - due to the nature of the risk - they are less useful.

The key to generating value from the web is to locate the risk verticals where it can offer a significant advantage relative to the internal data that insurers hold. Locating opportunities to supplement and create supermodularity between internal and external data sets will also add value.

Companies like Swiss Re are already using public data to improve underwriting results and decrease the number of questions the insurer has to ask consumers to underwrite them. Riccardo Baron, big data and smart analytics lead for Americas at Swiss Re says currently available data opportunities were "inconceivable" only a few years ago [2].


Cytora uses AI to sift through news data, layering it with publicly available information from the UK Fire and Rescue service. As a result, we know the total number of commercial fires reported in the UK, and we have the geolocation and wider property attributes for each incident. If you are an insurer with a minority share of the commercial lines market, this provides a significant increase in the number of loss observations enabling improved micro-segmentation and more surgical pricing.

Equally, there are also thousands of product recalls reported by companies every year, mentioning names of companies, sectors, products, units recalled and components major consumer product recalls in the US in the first half of 2016. By capturing, connecting and applying machine learning to this data, we are able to quantify expected number of recalls per sector and company and make predictions about the severity and cost of the recall.

Why are some risks worth observing via unstructured web data while others are not?

There are two driving factors here: publishing incentive and public visibility.

Publishing incentive:

The internet is an efficient, convenient way to promote, share and store data. When there is a durable incentive (either regulatory, governmental, social or commercial) to report and share information, it usually ends up online. For example, company lawsuits are reported in the US due to long-standing regulations about transparency, and companies House data in the UK - a web dataset that holds crucial information about financial ratings for commercial entities - was published as an innovation initiative by the UK government. 

Public visibility:

Data is prevalent where losses are public, witnessed by people and communicated on social networks or news sites (e.g. commercial fires).

A good comparison is product recall vs satellite failure. The web captures close to 100% of product recalls because companies are obligated to publish and share this information due to public safety and commercial considerations. Satellite failures, however, happen less frequently, occur in the realm of private companies with little to no incentive to share this information publicly.


Different types of information can be captured relating to specific points within the insurer workflow. We’ve spoken so far about loss information but in reality, this operates in a continuum, connecting other areas including exposure changes and predictive rating factors.

  • Loss Information

Aggregated loss events describing where losses happen, to whom they happen to and inferences about the severity/quantum of the event. This enables an assessment of the loss frequency and severity of different risks based on combinations of attributes.

  • Exposure Shifts

Information about dynamic changes in exposure as properties and companies change their attributes and behaviours. This enables pricing and rating to be updated rapidly and also at the point of renewal.

  • Rating Factors

Factors which predict or correlate with the severity and occurrence of loss. This enables risks to be selected and priced in a segmented way. 

The nature of the risk and the incentives of the entity involved dictates the extent to which high-resolution data is produced. The matrix below shows which types of events are captured in web data based on the parameters outlined above. Incidents such as injury to workers and absenteeism are unlikely to be reported, therefore web coverage is low in relation to a product recall, where public reporting is mandatory.

Key: Green: High Coverage, Orange: Moderate Coverage, Red: Limited Coverage

Web data combined with machine learning provides a powerful means to improve insurers understanding and the granularity of risk in emerging risk areas. This type of data will provide the analytic foundation for new models that are built in this space.


  1. Insurance Risk Study, Aon Benfield, 2015