- Industry: Finance, Investments, Technology
- Technologies: Data Crawling, Word2Vec (Doc2Vec), gensim, Clusterization, and Text Classification
- Partnership period: 6 months
- Team size: 3 experts
- Software products: Crawling and aggregating news for providing trend detection functionality
- Expertise delivered: Development from scratch, implementation, and support
The client is a leading financial data provider that covers the global venture capital, private equity, and public markets. We remained its business partner for over 13 years, starting with a small team and growing into a primary full-cycle technology provider. One of our teams has already been crawling financial news for this company. Over time, the volume of information has increased from approximately 100 sources of news to 2,000-3,000 sources. It was impossible for the team to process such a large amount of information by relying only on human experts, so it was decided to automate this process. While working on a solution, it became clear that it has a clear business value and thus could be an additional microservice for our partner.
Developing a new microservice with a visual interface on the website that could help clients to find trends to invest in. Broadening the partner’s functionality with News Trend Detection Service will help acquire new clients as well as increasing the trust of existing ones.
The biggest challenge is definitely the volume of information — around 50,000 articles in a day should be processed; that’s 18 million a year! The right set of Machine Learning algorithms must be used and adjusted to handle this volume properly. From a technical perspective, the team needed more powerful servers to analyze such a gigantic amount of information, as Machine Learning algorithms require a lot of processing power.
The ML-oriented DataDev team took over this project. It consists of three top experts: Yuriy Batora in the position of Team Lead, and two Data Scientists — Oleksii Shashliuk and Denys Stupak. Yuriy Batora and a software developer took the initiative to find an ML-solution for the large amount of data collected by our partner. Everyone understood that there was great potential in the automatic processing of unused information. So, the first demo was created in three months. After the project was approved, the DataDev team was formed to handle the development.
“Our first move was to create software for classification of news, grouping them into various topics and deciding which are relevant for further processing by human experts. At that moment we had around 20 specialists for that task, and it was important to decrease the workload for them. But while exploring Machine Learning technology, we realized that it can recognize the context of news and classifying them in much smaller groups.”
– Yuriy Batora, Team Lead
Using the clusterization method and adjusting parameters, the team managed to enable a model to group news in a particular way. As a result, trend detection became possible: Machine Learning algorithms can group topics for a certain time span. While a topic is becoming more popular, news on that topic is grouped, keywords are extracted, and the description for a new “trend” using those keywords is created.
The next step is to run a trend through another set of ML algorithms to determine whether it’s declining, constant, or rising. This is how the research team can easily spot trends that became 1,000% more popular in last month, for example, and pay attention to them as quickly as possible.
“We had used Word2Vec, and later Doc2Vec — ML-based technologies for text vectorization. Both technologies convert text into a mathematical vector, which represents the essence of the text. Word2Vec leverages Machine Learning in the form of Neural Networks to describe human speech in multiple dimensions. After Word2Vec and Doc2Vec do the magic, Clusterization Machine Learning algorithm is grouping the results.”
– Yuriy Batora, Team Lead
Speaking of the technical side, as mentioned before, Machine Learning requires computing power. The team started on PCs then switched to existing servers, but it still took up to 24 hours to process a batch of news — and that was too long. DataDev ended up requesting GPU Farm, a system based on video cards that is able to make proper and fast calculations for complex algorithms. This solution sped up the process from almost a day to two to three hours, creating more opportunities for timely adjustments and adapting more algorithms.
The News Trend Detection Service, named Emerging Spaces, was completed in six months. It was the first project for our partner that provided information for the clients reactively. Here is what the user interface looks like:
Now, Emerging Spaces is processing around 60,000 news topics a day — and the number is constantly increasing. The workload for manual data processing had decreased by 50%, which has improved the precision and speed of news analysis for human experts. Combined with another ML-based service that helps researchers contextualize news faster, the overall efficiency of analysts increased dramatically.
Since the clients of a financial data provider are interested in discovering new trends in which to invest, we provided that functionality to the platform. The service that our team created became a popular feature among clients of the platform and now they are able to receive more holistic, precise, and valuable business insights. The DataDev team is continuing to support and expand the project, constantly receiving feedback from the users.
“I’m looking for investment opportunities and going thematically by the way you split up, it almost feels like you’ve tailored it to the stuff we care about.”
– VC Firm
“I just noticed your new Emerging Spaces. I didn’t know this industry exists! It was interesting to look through. Blockchain Real Estate, who knew! It’s interesting because it flips the thought process into this is interesting because it will grow. You see the biggest deals in fintech like oh yes they’ll be a lot of movement, but the insect foods there’s more opportunity. I’m going to look into investing in one of those companies now. This report says it’s going to become a $30B dollar industry in the next 10 years and right now it’s only $21M capital invested.”
– Director of Research
ARE YOU INTERESTED IN DEVELOPING ML-BASED SOFTWARE SOLUTION?
Contact our experts to get a free consultation and time&budget estimate for your project.Contact Us