Najdi Arabic Language Data

Najdi is one of the three main Arabic dialects spoken in Saudi Arabia, along with Hejazi and Gulf Arabic. Out of the three main sub-dialects (Northern, Central and Southern Najdi), the Central variant is the one spoken in Riyadh and is the most widely used.

As with Modern Standard Arabic, the main orthographic convention for Najdi is to omit tashkil (used to indicate short vowels), and to include only consonant marking (i’jam) to indicate long vowels.

Volume of Language Data

lexical-forms-arabic

Total number of forms

The total number of forms is approximately 18,000,000

The following is a breakdown of the approximate number of forms in the Najdi Arabic Language Data:

  • Non-inflectional morphology:
      • About 1,000 forms for determiners, pronouns, prepositions…
      • Note: this will include forms constructed using the pronominal suffixes, but will not include definite article or any other prefixes.
  • Inflectional and derivational morphology:
      • Verbs: 1,000,000 forms
      • Nouns: 750,000 forms
      • Adjectives: 250,000 forms
number-of-lemmas-arabic-lexical

Total number of lemmas

20,000 lemmas

Features

Each form is annotated with its corresponding lemma, POS, and morphological attributes: voice, tense, mood, number, person, gender, case, state, possessive-number, possessive-person and possessive-gender.
h

Lemma

The canonical form for the inflected word.
{

POS

Part of Speech such as noun, verb, adjective, etc.
v

Voice

Verb form is classified as active or passive.
+

Tense

Specifies when the action takes place such as past, present, future, etc.

Aspect

Not applicable.

Mood

Modality of the verb form: indicative, subjunctive, imperative, etc.

Person

Verb or pronoun refers to the first, second or third person.

Number

State of being singular, dual or plural.

Gender

Noun, verb or adjective forms, masculine, feminine, neuter, etc.

Case

The function that the noun or adjective plays within a sentence.
R

Degree

Not applicable.
l

Definiteness State

Specifies whether a noun or adjective refers to a concrete or general concept.
O

Negative

Not applicable.
|

Contractions

Not applicable.

Pronominal Clitics

Clitic pronouns are identified and tagged.
w

Formality

Not applicable

Frequency

Relative frequency of the form based on a large general-purpose corpus.

Named Entities

Pre-defined entities are tagged as person names, places, organization, etc.
r

Offensive

Indicates whether the form might be considered offensive in certain contexts.

Non-inflectional POSs

The data contains all the forms, lemma, POS and morphological attributes (voice, tense, mood, transitivity, number, person, gender, case, state, pronominal-number, pronominal-person and pronominal-gender) for non-inflectional POSs (determiners, pronouns, prepositions…), i.e. POSs not included in features Inflectional and Derivational morphology.

Inflectional Morphology Data

The Lexical Resource for Najdi contains all the forms for all POSs: nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, numerals and particles.

Derivational Morphology Data

It also includes all derivational forms including adjectives derived from nouns (nisba) and verbal nouns or adverbs derived from adjectives.

Extended Morphology Data

The data also covers the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common combinations of productive prefixes.

Complementary Semantic Annotations

Usage

Information regarding the usage of specific forms or lemmas is included. Words that are rarely used will be tagged as “rare”. Words borrowed from foreign languages that are not widely understood, or which are only meaningful as part of a larger phrase, will be tagged by Licensor as “foreign”.

Words or spelling variations which are not officially recognized but are widely used in texting communication will be tagged by Licensor as “SMS”.

Offensive Language

The data regarding offensive, vulgar and sensitive words with all the lemmas, POS and attributes, is included. Words that are meant to demean or express hatred for a specific person or group based on race, ethnicity, sexual orientation, etc. will be tagged by Licensor as “offensive”.

Words that make explicit and offensive references to sex or bodily functions, or that are rude or in bad taste (including profanity), will be tagged as “vulgar”.

Words that are not themselves offensive or vulgar, but could be part of potentially vulgar, offensive or discomforting phrases will be tagged by Licensor as “sensitive”.

Words that could potentially be vulgar or offensive when used in a particular context will also be tagged by Licensor as “sensitive”.

Categories

Categories include the data regarding frequently used words with all the lemmas, POS and attributes. Frequently used words are considered to fall under, and will be tagged by Licensor with, one of the following categories:

 

  • Animal
  • Body part
  • Cities
  • Clothing
  • Color
  • Computer
  • Family name
  • Female first name
  • Fruit/vegetable
  • Greetings (including multi-word expressions)
  • Male first name
  • Measures
  • Organization
  • Plant
  • Professions
  • Relation
  • Seasons
  • Sport
  • States
  • Transportation
  • Weather

Transitivity

The data regarding the transitivity of verbs, which determines the applicability of pronominal suffixes to verbal forms, is included.

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA