Fachgebiet Neuro-Informationstechnik

Database Characteristics

Charactristic:

Bildergalerie (1 Bilder)
 handwritten document (Bild 1 von 1) 
  • The database contians 285 manuscript pages, 6000 word images and 8000 segmented character images.
  • Manuscript page images are saved in their origenal colors in PNG format, whereas word images saved in three different versions (Gray-scale, Binary, and Thinned).
  • The theme of most manuscript collections is the islamic jurisprudence; where handwritten words overs most Arabic parts of speech in addition to some cities names and security terms.

 Database statistics:

 Frequency analysis proves that letter distribution in IESK-arDB almost has the same frequency pattern compared to letter distribution of the huge digital corpora used in the Intellyze, which contains about 1,297,259 words or 5,122,132 letters. A normalized Chi-square test shows that letter frequency in both sources are nearly following the same distribution with a goodness fit value of X=0.98.

distr2 letter distribution
The Letters frequency in IESK-arDB compared to the letters frequency in huge digital corpora. The frequency distribution of Arabic letters in IESK-arDB, sorted according to the alphabet sequence.  

Last Modification: 17.01.2024 - Contact Person: Webmaster