Automatic categorisation of businesses using machine learning
Objective
In this project, a machine learning model is developed that automatically categorizes companies as innovative or not based on the text scraped from their websites. The method developed in this project can provide a complementary statistic for current innovation statistics, which are traditionally produced on the basis of the Community Innovation Survey. This new statistic has the advantage of being able to be produced more frequently, as well as being able to cover the entire set of companies in Belgium instead of only a relatively small sample.
The intention of this project was not to develop a new statistic for the entire population of companies. However, this project aims to demonstrate the feasibility of this approach based on a sample of Belgian companies. If successful, this approach will be scaled up to the full set of companies in Belgium.
This project was carried out as part of an internship and master’s thesis. The detailed description of this study can be found here (in Dutch)(PDF file opens in new window).
Data
The study uses the sample of Flemish companies included in the Community Innovation Survey (CIS) of 2019. The following data about companies is used in the study: company name, URL if known, inno5 label. The inno5 label indicates whether a company is considered innovative or not.
Furthermore, the study collects the visible text from the websites of companies included in the CIS by means of web scraping.
Methods
The following methods are used in this project.
- Web scraping to collect the visible texts of the company websites.
- Natural Language Processing to clean up the scraped texts.
- Machine Learning for learning a model that categorizes companies as innovative or non-innovative based on the text found on their website.
Results
The results described herein (in Dutch)(PDF file opens in new window) show that it is feasible to automatically categorize companies as innovative or non-innovative based on the text found on company websites.
Based on this positive result, a follow-up project will be started to generalize this method to the entire population of Flemish companies.