Unsupervised Threshold Autoencoder to Analyze and Understand Sentence Elements
Abstract
Analysis of legal and contract documents often requires both the discovery of document structure, as well as the accurate identification of important elements such as party (buyer, supplier), nature (obligation, right) and category (warranties, delivery, etc.). Hence, exploring novel features that lead to better element classification accuracy as well as better document structure discovery is particularly important. In this paper, we develop and present novel unsupervised learning techniques to analyze a large scale corpus of contract documents with the goal of learning and deriving new features to enhance classification accuracy over the elements of interest, and to extract relevant features leading to meaningful clusterings over contract structures. Particularly, we propose a novel t-threshold autoencoder neural network that flexibly controls the number of active neurons in response to sentences of different lengths at the network's input. Such an adaptive sparseness threshold enforces competition and specialization among encoding neurons and hence results in better features learning. We also present an extension of the convolutional neural network classifier that allows for the incorporation of these novel augmented features and show that higher classification accuracies over various classes of contract elements can be achieved. We further present a practical pipeline of deriving features from contract documents along with a clustering solution based on the K-means algorithm that leads to the separation among different types of sentences in the contract documents. We empirically demonstrate the performance of our developed techniques on a novel data corpus of Software Procurement contracts.