Geography

Greater Seattle Area, West Coast

Partnership period

2007 – Ongoing

Industry

Finance, Investments, Technology

Solution

Crawling and aggregating data to detect future trends

Overview

Best Financial & Market Data Solution by SIIA for the eight consecutive years and GeekWires Deal of the Year winner. Maintained a 95% CAGR over 10 years, PitchBook now is a part of Morningstar – a global financial services firm with a market cap over $6B.

Goal

Develop a new microservice, that could help PitchBook clients discover new investment opportunities. Extending PitchBook’s functionality with News Trend Detection Service will help acquire new clients as well as increasing the loyalty of existing ones.

Challenges

A large amount of data (around 50,000 articles in a day should be processed, 18 million a year), required the right set of Machine Learning algorithms and a lot of processing power to analyze such a huge amount of information.

Solution

Using the clusterization method and adjusting parameters, we create a model to group news results in a specific way. While a topic is becoming more popular, news on that topic is grouped, keywords are extracted, and the description for a new “Trend” using those keywords is created.

The next step is the “Trend” qualification: with another set of Machine Learning algorithms, we can determine whether it’s decreasing, fixed, or rising. This is how the research team can, for example, easily spot trends that became 1,000% more popular in last month.

Our first move was to create software for classification of news, grouping them into various topics and deciding which are relevant for further processing by human experts. At that moment we had around 20 specialists for that task, and it was important to decrease the workload for them. But while exploring Machine Learning technology, we realized that it can recognize the context of news and classifying them in much smaller groups.

We had used Word2Vec, and later Doc2Vec — ML-based technologies for text vectorization. Both technologies convert text into a mathematical vector, which represents the essence of the text. Word2Vec leverages Machine Learning in the form of Neural Networks to describe human speech in multiple dimensions. After Word2Vec and Doc2Vec do the magic, the Clusterization Machine Learning algorithm is grouping the results.

Yuriy Batora, Team Lead

Results

The News Trend Detection Service, named Emerging Spaces, was completed in six months. It was the first project on PitchBook that provided information for the client reactively.

Today, Emerging Spaces is processing around 60,000 news topics a day — and the number is constantly increasing. The workload for manual data processing had decreased by 50%, which has improved the precision and speed of news analysis for human experts. Combined with another ML-based service that helps researchers contextualize news faster, the overall efficiency of analysts increased dramatically.

Trend Detection Solution

Since PitchBook clients are interested in discovering new trends in which to invest, SPD Group provided that functionality to the platform. The service that our team created became a popular feature among clients and now they can receive more holistic, precise, and valuable business insights. The DataDev team is continuing to support and expand the project, constantly receiving feedback from the users. This is what PitchBook clients think about Emerging Spaces:

I just noticed your new Emerging Spaces. I didn’t know this industry exists! It was interesting to look through. Blockchain Real Estate, who knew! It’s interesting because it flips the thought process into this is interesting because it will grow. You see the biggest deals in FinTech like oh yes they’ll be a lot of movement, but the insect foods there’s more opportunity. I’m going to look into investing in one of those companies now. This report says it’s going to become a $30B dollar industry in the next 10 years and right now it’s only $21M capital invested.

Director of Research

IN NUMBERS

60K

News topics are processed daily

50%

Manual data processing workload decrease