Field of Focus III - Funded Projects Development of a GroundTruth for the early Chinese press
Project leader:
Matthias Arnold
This project set the foundation for automated processing of materials in non-latin scripts. Taking the Chinese newspaper 晶報 Jingbao (The Crystal, Shanghai, 1919–1940) on „Early Chinese Periodicals Online“ (ECPO) as example, the project created two sets of GroundTruths: 1) Two month fully segmented (bounding boxes and lables). Data is available in web annotation XML format. 2) One complete month of full text, produced using the blind double keying method. Data is available in XML format on GitHub. To make use of the GroundTruth data, an infrastructure developed by READ (H2020) at the EPFL was set-up on an external testing server and a prototype started for model training. In addition, the ECPO database was expanded by a searchable full text field, to gradually make new data available to the end-users.
Contact:
arnold@hcts.uni-heidelberg.de