Database Characteristics
Charactristic:
- The database contians 285 manuscript pages, 6000 word images and 8000 segmented character images.
- Manuscript page images are saved in their origenal colors in PNG format, whereas word images saved in three different versions (Gray-scale, Binary, and Thinned).
- The theme of most manuscript collections is the islamic jurisprudence; where handwritten words overs most Arabic parts of speech in addition to some cities names and security terms.
Database statistics:
Frequency analysis proves that letter distribution in IESK-arDB almost has the same frequency pattern compared to letter distribution of the huge digital corpora used in the Intellyze, which contains about 1,297,259 words or 5,122,132 letters. A normalized Chi-square test shows that letter frequency in both sources are nearly following the same distribution with a goodness fit value of X=0.98.
The Letters frequency in IESK-arDB compared to the letters frequency in huge digital corpora. | The frequency distribution of Arabic letters in IESK-arDB, sorted according to the alphabet sequence. |