About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
JCDL 2012
Conference paper
Transforming Japanese archives into accessible digital books
Abstract
Digitized physical books offer access to tremendous amounts of knowledge, even for people with print-related disabilities. Various projects and standard activities are underway to make all of our past and present books accessible. However digitizing books requires extensive human efforts such as correcting the results of OCR (optical character recognition) and adding structural information such as headings. Some Asian languages need extra efforts for the OCR errors because of their many and varied character sets. Japanese has used more than 10,000 characters compared with a few hundred in English. This heavy workload is inhibiting the creation of accessible digital books. To facilitate digitization, we are developing a new system for processing physical books. We reduce and disperse the human efforts and accelerate conversions by combining automatic inference and human capabilities. Our system preserves the original page images for the entire digitization process to support gradual refinement and distributes the work as micro-tasks. We conducted trials with the Japanese National Diet Library (NDL) to evaluate the required effort for digitizing books with a variety of layouts and years of publication. The results showed old Japanese books had specific problems when correcting the OCR errors and adding structures. Drawing on our results, we discuss further workload reductions and future directions for international digitization systems. © 2012 ACM.