Learning the Parameters of Bayesian Networks from Uncertain Data
Abstract
The creation of Bayesian Networks, a leading probabilistic modeling paradigm, often requires specifying a large number of parameters, making it highly desirable to be able to learn these parameters by utilizing historical data. However, in many cases, much of the available data is often in unstructured format. For example, in diagnosis networks, symptoms are usually described by a physician or technician in free text. Recent advances in unstructured analysis have made it possible to extract useful information from these sources. These techniques, however, have an inherent uncertainty; furthermore, it may be necessary to combine multiple unstructured analysis techniques to extract the necessary information, further compounding the level of uncertainty. Because of the inability of current learning algorithms to incorporate such uncertainty, common approaches are either to ignore this uncertainty, thus reducing the resulting accuracy, or completely disregard unstructured data. We present an approach for learning Bayesian network parameters that explicitly incorporates the uncertainty of unstructured data. Our contributions include a generalization of the Expectation Maximization algorithm that enables it to handle any historical data with likelihood evidence, and a methodology, that builds upon this algorithm, which extends the structure of a Bayesian Network to support the uncertainty associated with an arbitrarily complex unstructured analysis pipeline. Our work also includes extensive empirical validation of our approach, as well as formal correctness and convergence proofs for the extended algorithm.