Fachgebiet Neuro-Informationstechnik

IESK-arDB: A database for off-line Arabic handwriting

IESK-arDB: A database for off-line Arabic handwriting

Overview

 The IESK-arDB is an off-line handwritten database.It contains 285 pages of a 14th century historical manuscripts, more than 6000 handwritten word images, and 8000 segmented character images. The word database vocabulary covers most of Arabic part of speech nouns, verbs, country/city names, security terms, and words used for writing bank amounts.

 

Data Aquisition:

Manuscript page images are collected from multiple islamic works that are tough to be written in the 14th century. The main sources are the book of Al-FRO written by IBN MUFLIH and the book of FAWAID FIGHIYAH (the writer is unknown). The handwritten word samples are collected from 22 writers from different Arabic countries and also from countries where the Arabic script is the writing medium. Writers have been asked to write according to Naskh style as much as they can. This has two reasons. First, Naskh is the most common used writing style. Second, compared to other writing styles, Naskh emphasizes most of the letters' structural peculiarities.

 

Ground Truthing:

Manuscript page images are ground-truthed by creating a UTF-8 text file for each page image. Each line in the text file exactly corresponds to a line in the page image. For a better view, we advice to set font to Segoe UI font. Each word is fully described by a ground truth XML file, that contains segmentation information besides other important entries.

    
handwritten document handwritten word  segmented word
Sample for an historical Arabic text page.   Samples for handwritten Arabic words. Word segmentation ground truth. 
GT
 Visualisation of grund truth for synthezized samples.

 

Database Characteristics

Charactristic:

  • The database contians 285 manuscript pages, 6000 word images and 8000 segmented character images.
  • Manuscript page images are saved in their origenal colors in PNG format, whereas word images saved in three different versions (Gray-scale, Binary, and Thinned).
  • The theme of most manuscript collections is the islamic jurisprudence; where handwritten words overs most Arabic parts of speech in addition to some cities names and security terms.

 Database statistics:

 Frequency analysis proves that letter distribution in IESK-arDB almost has the same frequency pattern compared to letter distribution of the huge digital corpora used in the Intellyze, which contains about 1,297,259 words or 5,122,132 letters. A normalized Chi-square test shows that letter frequency in both sources are nearly following the same distribution with a goodness fit value of X=0.98.

distr2 letter distribution
The Letters frequency in IESK-arDB compared to the letters frequency in huge digital corpora. The frequency distribution of Arabic letters in IESK-arDB, sorted according to the alphabet sequence.  
Registration and Download

Registration:

Please sent an e-mail with your name and affilation to

 

Download:

Sample (no registration required) 

IESK-arDB

 

If you use this database in your research, please cite the following paper: 

[1] M. Elzobi, A. Al-Hamadi, Z. A. Aghbari, and L. Dinges, “IESK-ArDB: a database for handwritten Arabic and an optimized topological segmentation approach,” International Journal on Document Analysis and Recognition (IJDAR) , vol. 16, no. 3, pp. 295–308, 2013.

[2] L. Dinges, A. Al-Hamadi, M. Elzobi, and S. El-etriby, “Synthesis of Common Arabic Handwritings to Aid Optical Character Recognition Research,” Sensors , vol. 16, no. 3, p. 346, 2016.

Last Modification: 17.01.2024 - Contact Person: Webmaster