Publication
JCDL 2012
Conference paper

Transforming Japanese archives into accessible digital books

View publication

Abstract

Digitized physical books offer access to tremendous amounts of knowledge, even for people with print-related disabilities. Various projects and standard activities are underway to make all of our past and present books accessible. However digitizing books requires extensive human efforts such as correcting the results of OCR (optical character recognition) and adding structural information such as headings. Some Asian languages need extra efforts for the OCR errors because of their many and varied character sets. Japanese has used more than 10,000 characters compared with a few hundred in English. This heavy workload is inhibiting the creation of accessible digital books. To facilitate digitization, we are developing a new system for processing physical books. We reduce and disperse the human efforts and accelerate conversions by combining automatic inference and human capabilities. Our system preserves the original page images for the entire digitization process to support gradual refinement and distributes the work as micro-tasks. We conducted trials with the Japanese National Diet Library (NDL) to evaluate the required effort for digitizing books with a variety of layouts and years of publication. The results showed old Japanese books had specific problems when correcting the OCR errors and adding structures. Drawing on our results, we discuss further workload reductions and future directions for international digitization systems. © 2012 ACM.

Date

Publication

JCDL 2012

Share