A machine-learning approach to discovering company home pages
Abstract
For many marketing and business applications, it is necessary to know the home page of a company specified only by its company name. If we require the home page for a small number of big companies, this task is readily accomplished via use of Internet search engines or access to domain registration lists. However, if the entities of interest are small companies, these approaches can lead to mismatches, particularly if a specified company lacks a home page. We address this problem using a supervised machine-learning approach in which we train a binary classification model. We classify potential website matches for each company name based on a set of explanatory features extracted from the content on each candidate website. Our approach is related to web-based business intelligence in two ways: (1) we build the training set for our learning algorithms through crowdsourcing tools and illustrate their potential for business research, and (2) the success of our model allows one to easily use corporate home pages as data inputs into other research projects. Through the successful use of crowdsourcing, our approach is able to identify a correct home page or recognize that a valid home page does not exist with an accuracy that is 57% better than simply taking the highest ranked search engine result as the correct match. © 2010 IEEE.