Sunday, October 27, 2019

Structure and Features of the Arabic Language

Structure and Features of the Arabic Language The Arabic language is a semantic language with a complicated morphology, which is significantly different from the most popular languages, such as English, Spanish, French, and Chinese. Arabic is an official language in over 22 countries. It is spoken as first language in North Africa (Algeria, Egypt, Libya, Morocco, Tunisia, Sudan), the Arabian Peninsula (Bahrain, Kuwait, Oman, Qatar, Saudi Arabia, United Arab Emirates, Yemen), Middle East (Iraq, Jordan, Lebanon, Syria), and other Arab countries (Mauritania, Comoros, Djibouti, Somalia). Since Arabic is the language of the Quran, the holy book of Islam, it is also spoken as a second language by several Asian countries such as: Indonesia, Pakistan, Iran, Uzbekistan and Malay[52]. More than 422 million people are able to speak Arabic, which makes this language the fifth most spoken language in the world, according to[53]. This chapter give brief description about the relevant basic elements of the Arabic language. This covers Arabic language structure, and the features of the Arabic writing system. The morphology of Arabic language and the Arabic word classes, i. e. nouns, verbs, and particles are presented in this chapter. The Arabic language challenges are also discussed in the last section of this chapter. 2.1.Arabic Language Structure The Arabic language is classified into three forms: Classical Arabic (CA), Colloquial Arabic Dialects (CAD), and Modern Standard Arabic (MSA). CA is fully vowelized and includes classical historical liturgical text and old literature texts. CAD includes predominantly spoken vernaculars, and each Arab country has its dialect. MSA is the official language and includes news, media, and official documents[16]. The direction of writing in the Arabic language is from right to left. The alphabet of the Arabic language consists of 28 as shown in Table 2-1. Table 2‑1: The alphabet of the Arabic language No. Alone Form Transliteration Initial Form Medial Form End Form 1 Ø § a Ø § Ùâ‚ ¬ÃƒËœÃ‚ § Ùâ‚ ¬ÃƒËœÃ‚ § 2 Ø ¨ b Ø ¨Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¨Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¨ 3 Ø ª t Ø ªÃƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ªÃƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ª 4 Ø « th Ø «Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ «Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ « 5 Ø ¬ j Ø ¬Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¬Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¬ 6 Ø ­ h Ø ­Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ­Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ­ 7 Ø ® kh Ø ®Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ®Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ® 8 Ø ¯ d Ø ¯ Ùâ‚ ¬ÃƒËœÃ‚ ¯ Ùâ‚ ¬ÃƒËœÃ‚ ¯ 9 Ø ° th Ø ° Ùâ‚ ¬ÃƒËœÃ‚ ° Ùâ‚ ¬ÃƒËœÃ‚ ° 10 Ø ± r Ø ± Ùâ‚ ¬ÃƒËœÃ‚ ± Ùâ‚ ¬ÃƒËœÃ‚ ± 11 Ø ² z Ø ² Ùâ‚ ¬ÃƒËœÃ‚ ² Ùâ‚ ¬ÃƒËœÃ‚ ² 12 Ø ³ s Ø ³Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ³Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ³ 13 Ø ´ sh Ø ´Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ´Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ´ 14 Ø µ s Ø µÃƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ µÃƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ µ 15 Ø ¶ tha Ø ¶Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¶Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¶ 16 Ø · ta Ø ·Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ·Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ · 17 Ø ¸ tha Ø ¸Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¸Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¸ 18 Ø ¹ aa Ø ¹Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¹Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ¹ 19 Ø º gh Ø ºÃƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ ºÃƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬ÃƒËœÃ‚ º 20 Ù  f Ù Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã‚ Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã‚  21 Ù‚ q Ù‚Ùâ‚ ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬Å¡Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬Å¡ 22 Ùƒ k ÙƒÙâ‚ ¬ Ùâ‚ ¬Ãƒâ„¢Ã†â€™Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã†â€™ 23 Ù„ l Ù„Ùâ‚ ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬Å¾ 24 Ù†¦ m Ù†¦Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬ ¦Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬ ¦ 25 Ù†  n Ù† Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬  Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬   26 Ù†¡Ãƒâ„¢Ã¢â€š ¬ h Ù†¡Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬ ¡Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬ ¡ 27 Ùˆ w Ùˆ Ùâ‚ ¬Ãƒâ„¢Ã‹â€  Ùâ‚ ¬Ãƒâ„¢Ã‹â€  28 ÙÅ   y ÙÅ  Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã…  Ãƒâ„¢Ã¢â€š ¬ Ùâ‚ ¬Ãƒâ„¢Ã…   The formulation and shape are different for the same letter, depending on its position within the word [24]. For example, the letter (Ø ¹) has the following styles: (Ø ¹Ãƒâ„¢Ã¢â€š ¬), if this letter appears at the beginning of the word, such as in Ø ¹ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¦ that means general; (Ùâ‚ ¬ÃƒËœÃ‚ ¹Ãƒâ„¢Ã¢â€š ¬), if this letter appears in the middle of the word, such as in ÙÅ  ÃƒËœÃ‚ ¹ÃƒËœÃ‚ ±Ãƒâ„¢Ã‚  that means know; (Ùâ‚ ¬ÃƒËœÃ‚ ¹), if this letter appears at the end of the word, such as in ÙÅ  ÃƒËœÃ‚ ³Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ¹ that means hear. Finally, the letter (Ø ¹) can appear as (Ø ¹) if this letter appears at the end of a word but disconnected from the letter before it such as in Ø ³ÃƒËœÃ‚ ±ÃƒËœÃ‚ ¹ that means fast see Figure (2-1). Figure 2-1: The Formulation and Shape for the Same Letter Thus, a three-letter word may start with a letter in beginning form, followed by a letter in medial form and, finally, by a letter in an end form such as: [Ø ¹Ãƒâ„¢Ã¢â‚¬ ¦Ãƒâ„¢Ã¢â‚¬Å¾] Instead of: [Ø ¹ Ù†¦ Ù„] But the reality is even worse since a letter, in the middle of a word, may have the final or the initial form as in [Ù ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³] Because some letters do not connect with any character that comes after. They have only two forms: isolated (which is also used as initial) and final (also used as middle). These letters are (Ø ¯ÃƒËœÃ…’ Ø °ÃƒËœÃ…’ Ø ±ÃƒËœÃ…’ Ø ²ÃƒËœÃ…’ Ùˆ) for example: [وؠ±ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ©] For the purpose of this thesis, we have defined our own transliteration scheme for Arabic alphabets, which is presented in Table 2.1. Each Arabic letter in this scheme is mapped to only one English letter. Wherever in this thesis, any Arabic word is annotated as a triple attribute to be more clear for a non-Arabic reader. The first attribute for the Arabic word itself which is written in Arabic scripts between two square brackets, the second attribute for an English transliteration which is written in italics, while the third one for English translation which is written between two quotation marks. Figure 2-2 shows an example. Figure 2-2: An Example of Annotated Arabic Word Three letters from the twenty-eight letters appear in different shapes, which are they: Hamza [Ø ¡]: This shape can be: on Alef [Ø £], below Alef [Ø ¥], on Waaw [Ø ¤], on Alef Maqsura [Ø ¦], or isolated [Ø ¡]. Taa-Marbuta [Ø ©]: This is a special form of the letter [Ø ª], it always appears at the end of the word. Alef-Maqsura [Ù†°]: This is a special form of the letter [Ø §], it always appears at the end of the word. The above three letters pose some difficulties when building morphological systems. Many of the written Arabic texts and Arabic web sites ignore the Hamza and the two dots above the Taa-Marbuta. For example, the Arabic word [Ù†¦ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ ©] (mdrst, school) may appear in many texts as [Ù†¦ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³Ãƒâ„¢Ã¢â‚¬ ¡] (mdrsh) (which means school or his teacher) without two dots above the last letter. When comparing the last letter in the two previous words, we found it was [Ø ©] in the first word, while it was [Ù†¡Ãƒâ„¢Ã¢â€š ¬] in the second word. Twenty-five of Arabic alphabets represent consonants. The remaining three letters represent the weak letters or the long vowels of Arabic (shortly vowels). These letters are: Alef[Ø §], Waaw[Ùˆ] and Yaa[ÙÅ  ].   Moreover, diacritics are used in the Arabic language, which are symbols placed above or below the letters to add distinct pronunciation, grammatical formulation, and sometimes another meaning to the whole word. Arabic diacritics include, dama (Ù ), fathah (ÙÅ ½), kasra (Ù ), sukon (Ù’), double dama (ÙŒ), double fathah (Ù†¹), double kasra (Ù ) [54]. For instance, Table 2-2 presents different pronunciations of the letter (Sad) ((Ø µ: Table 2‑2: Presents different pronunciations of the letter (Sad) (Ø µ) Ø µÃƒâ„¢Ã¢â‚¬â„¢ Ø µÃƒâ„¢Ã…’ Ø µÃƒâ„¢Ã‚  Ø µÃƒâ„¢Ã¢â‚¬ ¹ Ø µÃƒâ„¢Ã‚  Ø µÃƒâ„¢Ã… ½ Ø µÃƒâ„¢Ã‚  /s/ /sun/ /sin/ /san/ /si/ /sa/ /su/ In addition, Arabic has special mark rather than the previous diacritics. this mark is called gemination mark (shaddah (Ø ´ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ©) or tashdeed). Gemination is a mark written above the letter (Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬Ëœ) to indicate a doubled consonant while pronouncing it. This is done when the first consonant has the null diacritical mark skoon (Ùâ‚ ¬Ãƒâ„¢Ã¢â‚¬â„¢), and the second consonant has any other diacritical mark. For example, in the Arabic word (كؠ³Ãƒâ„¢Ã¢â‚¬â„¢ÃƒËœÃ‚ ³Ãƒâ„¢Ã… ½ÃƒËœÃ‚ ±) (kssr, he smashed to pieces), when the first syllable ends with (Ø ³)(s) and the next starts with (Ø ³) (s), the two consonants are united and the gemination mark indicates this union. So, the previous word is written as (كؠ³Ãƒâ„¢Ã¢â‚¬ËœÃƒËœÃ‚ ±), and it has four letters {Ùƒ Ø ³ Ø ³ Ø ±}[55]. The Arabic language has two genders, feminine (Ù†¦ÃƒËœÃ‚ ¤Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ «) and masculine (Ù†¦ÃƒËœÃ‚ °Ãƒâ„¢Ã†â€™ÃƒËœÃ‚ ±); three numbers, singular (Ù†¦Ãƒâ„¢Ã‚ ÃƒËœÃ‚ ±ÃƒËœÃ‚ ¯), dual (Ù†¦ÃƒËœÃ‚ «Ãƒâ„¢Ã¢â‚¬  Ãƒâ„¢Ã¢â‚¬ °), and plural (Ø ¬Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ¹); and three grammatical cases, nominative (Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ±Ãƒâ„¢Ã‚ ÃƒËœÃ‚ ¹), accusative (Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ µÃƒËœÃ‚ ¨), and genitive (Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¬ÃƒËœÃ‚ ±). In general, Arabic words are categorized as particles (Ø §ÃƒËœÃ‚ ¯Ãƒâ„¢Ã‹â€ ÃƒËœÃ‚ §ÃƒËœÃ‚ ª), nouns (Ø §ÃƒËœÃ‚ ³Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ §ÃƒËœÃ‚ ¡), or verbs (Ø §Ãƒâ„¢Ã‚ ÃƒËœÃ‚ ¹ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬Å¾). Nouns in Arabic including adjectives (Ø µÃƒâ„¢Ã‚ ÃƒËœÃ‚ §ÃƒËœÃ‚ ª) and adverbs (Ø ¸ÃƒËœÃ‚ ±Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã‚ ) and can be derived from other nouns, verbs, or particles. Nouns in the Arabic language cover proper nouns (such as people, places, things, ideas, day an d month names, etc.). A noun has the nominative case when it is the subject (Ù ÃƒËœÃ‚ §ÃƒËœÃ‚ ¹Ãƒâ„¢Ã¢â‚¬Å¾); accusative when it is the object of a verb (Ù†¦Ãƒâ„¢Ã‚ ÃƒËœÃ‚ ¹Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã¢â‚¬Å¾) and the genitive when it is the object of a preposition (Ù†¦ÃƒËœÃ‚ ¬ÃƒËœÃ‚ ±Ãƒâ„¢Ã‹â€ ÃƒËœÃ‚ ± Ø ¨ÃƒËœÃ‚ ­ÃƒËœÃ‚ ±Ãƒâ„¢Ã‚  Ø ¬ÃƒËœÃ‚ ±) [56]. Verbs in Arabic are divided into perfect (Ø µÃƒâ„¢Ã…  ÃƒËœÃ‚ ºÃƒËœÃ‚ © Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã‚ ÃƒËœÃ‚ ¹Ãƒâ„¢Ã¢â‚¬Å¾ Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ªÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¦), imperfect (Ø µÃƒâ„¢Ã…  ÃƒËœÃ‚ ºÃƒËœÃ‚ © Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã‚  Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬Å¡ÃƒËœÃ‚ µ) and imperative (Ø µÃƒâ„¢Ã…  ÃƒËœÃ‚ ºÃƒËœÃ‚ © Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ±). Arabic particle category includes pronouns(Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¶Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ §ÃƒËœÃ‚ ¦ÃƒËœÃ‚ ±), adjectives(Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ µÃƒâ„¢Ã‚ ÃƒËœÃ‚ §ÃƒËœÃ‚ ª), adverbs(Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ã ƒËœÃ‚ §ÃƒËœÃ‚ ­Ãƒâ„¢Ã‹â€ ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬Å¾), conjunctions(Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¹ÃƒËœÃ‚ ·Ãƒâ„¢Ã‚ ), prepositions (Ø ­ÃƒËœÃ‚ ±Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã‚  Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¬ÃƒËœÃ‚ ±), interjections (Ø µÃƒâ„¢Ã…  ÃƒËœÃ‚ ºÃƒËœÃ‚ © Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ªÃƒËœÃ‚ ¹ÃƒËœÃ‚ ¬ÃƒËœÃ‚ ¨) and interrogatives (Ø ¹Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ §ÃƒËœÃ‚ ª Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ §ÃƒËœÃ‚ ³ÃƒËœÃ‚ ªÃƒâ„¢Ã‚ Ãƒâ„¢Ã¢â‚¬ ¡ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¦) [57]. 2.2.Arabic Morphology The Arabic language is one of the highly sophisticated natural languages which has a very rich and complicated morphology. Morphology is the part of linguistics that deal with the internal structure and formation processes of words. A morpheme is often defined as the smallest meaningful and significant unit of language, which cannot be broken down into smaller parts[58]. So, for example, the word apple consists of a single morpheme (the morpheme apple), while the word apples consist of two morphemes: the morpheme apple and the morpheme -s (indication of plural). In Arabic language for example, the word (Ø ³ÃƒËœÃ‚ £Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¡Ãƒâ„¢Ã¢â‚¬ ¦, he asked them) consists also of two morphemes the verb (Ø ³ÃƒËœÃ‚ £Ãƒâ„¢Ã¢â‚¬Å¾, he ask) and the pronoun (Ù†¡Ãƒâ„¢Ã¢â‚¬ ¦, them). According to the previous examples, there are two types of morphemes: roots and affixes. The root is the main morpheme of the word, supplying the main meaning, while the affixes are added i n the beginning, middle or end of the root to create new words that add additional meaning of various kinds. In more general morphemes could be classified as: (1) roots morphemes and (2) affixes morphemes, Figure 2.3 illustrated this classification. Figure 2-3: Morpheme Classification Root is the original morpheme of the word before any transformation processes that comprises the most important part of the word and cannot be reduced into smaller constituents. In other words, it is the primary unit of the family of the same word after removing all inflectional and derivational affixes which can stand on their own as words (independent words). The root morphemes divided into two categories. The first category is called lexical morphemes, which covers the words in the language carrying the content of the message. Examples from English language: book, compute, and write, while examples from Arabic language: (قؠ±ÃƒËœÃ‚ £, read), (لؠ¹ÃƒËœÃ‚ ¨, play), and (كؠªÃƒËœÃ‚ ¨, write). The second category is called stop words morphemes, which covers the function words in the language. The stop words include adverbs, prepositions, pronouns, conjunctions, and prepositions. Examples from English language: on, that, the, and above. Examples from Arabic language: (Ù Ãƒâ„¢Ã…  , in), (Ù Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã¢â‚¬Å¡, above), and (Ø ªÃƒËœÃ‚ ­ÃƒËœÃ‚ ª, under). Affixes morphemes are also units of meaning; however, they cannot occur as words on their own; they need to be attached to something such as root morphemes. There are three types of affixes in Arabic language: prefixes, infixes, and suffixes. In some cases, all of these affixes can be found in one word as in the word[وؠ§Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ­ÃƒËœÃ‚ §ÃƒËœÃ‚ ±ÃƒËœÃ‚ ¨Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã¢â‚¬  ] (and the warriors). This word has ten letters, three of them are root-letters, while the others are affixes. The root of this word is [Ø ­ÃƒËœÃ‚ ±ÃƒËœÃ‚ ¨] (war). The example in Figure 2.4 can clearly deduce the differences between the three main terms used in computational linguistics: roots, stems and affixes. Figure 2-4: The Decomposition of the Word [وؠ§Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ­ÃƒËœÃ‚ §ÃƒËœÃ‚ ±ÃƒËœÃ‚ ¨Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã¢â‚¬  ]. 2.3.Arabic Language Challenges Arabic is a challenging language in comparison with other languages such as English for a number of reasons:   In English, prefixes and suffixes are added to the beginning or end of the root to create new words. In Arabic, in addition to the prefixes and suffixes there are infixes that can be added inside the word to create new words that have the same meaning. For example, in English, the word write is the root of word writer. In Arabic, the word writer (كؠ§ÃƒËœÃ‚ ªÃƒËœÃ‚ ¨) is derived from the root write (كؠªÃƒËœÃ‚ ¨) by adding the letter Alef (Ø §) inside the root. In these cases, it is difficult to distinguish between infix letters and the root letters. he Arabic language has a rich and complex morphology in comparison with English. Its richness is attributed to the fact that one root can generate several hundreds of words having different meanings. Table 2-4 presents different morphological forms of root study (Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³). Table 2‑3: Different morphological forms of word study (Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³). Word Tense Pluralities Meaning Gender Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ Past Single He studied Masculine Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ ª Past Single She studied Feminine ÙÅ  ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ Present Single He studies Masculine Ø ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ Present Single She studied Feminine Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ § Past Dual They studied Masculine Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ ªÃƒËœÃ‚ § Past Dual They studied Feminine ÙÅ  ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬   Present Dual They study Masculine Ø ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬   Present Dual They study Feminine ÙÅ  ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ § Present Dual They study Masculine Ø ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ § Present Dual They study Feminine Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³Ãƒâ„¢Ã‹â€ ÃƒËœÃ‚ § Past Plural They studied Masculine Ø ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³Ãƒâ„¢Ã¢â‚¬   Past Plural They studied Feminine Ø ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³Ãƒâ„¢Ã¢â‚¬   Present Plural They study Feminine Ø ³Ãƒâ„¢Ã…  ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ Future Single They will study Masculine Ø ³ÃƒËœÃ‚ ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ Future Single They will study Feminine Ø ³Ãƒâ„¢Ã…  ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ § Future Dual They will study Masculine Ø ³ÃƒËœÃ‚ ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³ÃƒËœÃ‚ § Future Dual They will study Feminine Ø ³Ãƒâ„¢Ã…  ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã¢â‚¬   Future Plural They will study Masculine Ø ³ÃƒËœÃ‚ ªÃƒËœÃ‚ ¯ÃƒËœÃ‚ ±ÃƒËœÃ‚ ³Ãƒâ„¢Ã‹â€ Ãƒâ„¢Ã¢â‚¬   Future Plural They will study Feminine Some Arabic words have different meanings based on their appearance in the context. Especially when diacritics are not used, the proper meaning of the Arabic word can be determined based on the context. For instance, the word (Ø ¹Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¦) could be Science (Ø ¹Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬â„¢Ãƒâ„¢Ã¢â‚¬ ¦), Teach (Ø ¹Ãƒâ„¢Ã… ½Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ËœÃƒâ„¢Ã… ½Ãƒâ„¢Ã¢â‚¬ ¦Ãƒâ„¢Ã¢â‚¬â„¢) or Flag (Ø ¹Ãƒâ„¢Ã… ½Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã… ½Ãƒâ„¢Ã¢â‚¬ ¦Ãƒâ„¢Ã¢â‚¬â„¢) depending on the diacritics [46]. Unfortunately, Arabic people do not explicitly mention the gemination mark in their writing. They depend on their knowledge of the language to supply the missing letter and write the words without it. In consequence, this is make the morphology process of such words is not an easy task [55]. Another challenge of automatic Arabic text processing is that proper nouns in Arabic do not start with a capital letter as in English, and Arabic letters do not have lower and upper case, which makes identifying proper names, acronyms, and abbreviations difficult. In English language, a word is a single entity. It may be a noun, a verb, a preposition, an article, , etc. While in Arabic language a single word could be a complete sentence. For example, Table 2.4 shows some single Arabic words and their equivalent English translations. Table 2‑4: Example: An Arabic Word could be a Complete English Sentence Arabic Word English Sentences Ø °Ãƒâ„¢Ã¢â‚¬ ¡ÃƒËœÃ‚ ¨ÃƒËœÃ‚ ª She go Ø ³ÃƒËœÃ‚ £Ãƒâ„¢Ã¢â‚¬Å¡ÃƒËœÃ‚ ±ÃƒËœÃ‚ £Ãƒâ„¢Ã¢â‚¬ ¡ I will read it Ø ³Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ¹Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¡ We hear him Ø §ÃƒËœÃ‚ ®ÃƒËœÃ‚ ¨ÃƒËœÃ‚ ±Ãƒâ„¢Ã¢â‚¬  Ãƒâ„¢Ã…   He told me Ù ÃƒËœÃ‚ ºÃƒËœÃ‚ §ÃƒËœÃ‚ ¯ÃƒËœÃ‚ ± Then he departed There are several free benchmarking English datasets used for document categorization, such as 20 Newsgroup, which contains around 20,000 documents distributed almost evenly into 20 classes; Reuters 21,578, which contains 21,578 documents belonging to 17 classes; and RCV1 (Reuters Corpus Volume 1), which contains 806,791 documents classified into four main classes. Unfortunately, there is no free benchmarking dataset for Arabic document classification. In the Arabic language, the problem of synonyms and broken plural forms are widespread. Examples of synonyms in Arabic are (Ø ªÃƒâ„¢Ã¢â‚¬Å¡ÃƒËœÃ‚ ¯Ãƒâ„¢Ã¢â‚¬ ¦, Ø ªÃƒËœÃ‚ ¹ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬Å¾, Ø £Ãƒâ„¢Ã¢â‚¬Å¡ÃƒËœÃ‚ ¨Ãƒâ„¢Ã¢â‚¬Å¾, Ù†¡Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¦) which means (Come), and (Ù†¦Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ ²Ãƒâ„¢Ã¢â‚¬Å¾, Ø ¯ÃƒËœÃ‚ §ÃƒËœÃ‚ ±, Ø ¨Ãƒâ„¢Ã…  ÃƒËœÃ‚ ª, Ø ³Ãƒâ„¢Ã†â€™Ãƒâ„¢Ã¢â‚¬  ) which means (house). In the Arabic language, the problem of broken plural forms occurs when some irregular nouns in the Arabic language in plural takes another morphological form different from its initial form in singular. For example, the word (Doctors, Ø §ÃƒËœÃ‚ ·ÃƒËœÃ‚ ¨ÃƒËœÃ‚ §ÃƒËœÃ‚ ¡) is a broken plural of the masculine singular (Doctor, Ø ·ÃƒËœÃ‚ ¨Ãƒâ„¢Ã…  ÃƒËœÃ‚ ¨). In the Arabic language, one word may have more than lexical category (noun, verb, adjective, etc.) in different contexts such as (wellspring, Ø ¹Ãƒâ„¢Ã…  Ãƒâ„¢Ã¢â‚¬   Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ §ÃƒËœÃ‚ ¡), (Eye, Ø ¹Ãƒâ„¢Ã…  Ãƒâ„¢Ã¢â‚¬   Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ ³ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬  ), (was appointed, Ø ¹Ãƒâ„¢Ã…  Ãƒâ„¢Ã¢â‚¬   Ù†¦ÃƒËœÃ‚ ¯Ãƒâ„¢Ã…  ÃƒËœÃ‚ ±ÃƒËœÃ‚ § للØ ´ÃƒËœÃ‚ ±Ãƒâ„¢Ã†â€™Ãƒâ„¢Ã¢â‚¬ ¡). In addition to the different forms of the Arabic word that result from the derivational process, there are some words lack authentic Arabic roots like Arabized words which are translated from other languages, such as (programs, Ø ¨ÃƒËœÃ‚ ±ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ¬ ), (geography, Ø ¬ÃƒËœÃ‚ ºÃƒËœÃ‚ ±ÃƒËœÃ‚ §Ãƒâ„¢Ã‚ Ãƒâ„¢Ã…  ÃƒËœÃ‚ ©), (internet, Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¥Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ ªÃƒËœÃ‚ ±Ãƒâ„¢Ã¢â‚¬  ÃƒËœÃ‚ ª ), etc. or names, places such as (countries, Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¨Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¯ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬  ), (cities, Ø §Ãƒâ„¢Ã¢â‚¬Å¾Ãƒâ„¢Ã¢â‚¬ ¦ÃƒËœÃ‚ ¯Ãƒâ„¢Ã¢â‚¬  ), (rivers, Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬  Ãƒâ„¢Ã¢â‚¬ ¡ÃƒËœÃ‚ §ÃƒËœÃ‚ ±), (mountains, Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ ¬ÃƒËœÃ‚ ¨ÃƒËœÃ‚ §Ãƒâ„¢Ã¢â‚¬Å¾), (deserts, Ø §Ãƒâ„¢Ã¢â‚¬Å¾ÃƒËœÃ‚ µÃƒËœÃ‚ ­ÃƒËœÃ‚ §ÃƒËœÃ‚ ±Ãƒâ„¢Ã¢â‚¬ °), etc. 2.4.Summary Arabic language is an international language belonging to the Semitic languages family (different from Indo-European languages in some respects). The Arabic alphabet consists of twenty-eight letters in addition to some variants of existing letters. Each letter can appear in up to four different shapes, depending on the position of the letter in the Arabic word. Twenty-five of Arabic letters represent consonants. The remaining three letters represent the long vowels of Arabic. The Arabic writing system goes from right to left and most letters in Arabic words are joined together. Arabic has a rich and complex morphology. In many cases, one orthographic word is comprising many semantic and syntactic words. Traditionally there are two types of morphology in Arabic language: roots morphemes and affixes morphemes. The root morphemes divided into two categories. The first category is called lexical morphemes, which covers the words in the language carrying the content of the message. The second category is called stop words morphemes, which covers the function words such as adverbs, prepositions, pronouns, conjunctions, and prepositions. Affixes morphemes cannot occur as words on their own; they need to be attached to something such as root morphemes. There are three types of affixes in Arabic language: prefixes, infixes, and suffixes. All Arabic words could be classified into three main categories according to the part-of-speech: noun, verb, and particle. The noun and verb in Arabic might be further divided according to: number (singular, dual and plural), and case (nominative, genitive and accusative). Arabic. The Arabic language is a challenging language in comparison with other languages and has a complicated morphological structure. Therefore, the Arabic language needs a set of preprocessing routines to be suitable for cl

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.