Simplified Chinese (ZH) Language Data
Inflectional Morphology Data
The Lexical Resource for Simplified Chinese contains all the standard inflectional forms for nouns, verbs, adjectives, postpositions, conjunctions, etc.
Derivational Morphology Data
Contains all the standard derivational forms including common compound words.
Extended Morphology Data
Contains the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as genitive forms and common contractions.
Frequency Indication
Contains the data regarding the relative frequency of appearance for the words in the above lists in the given language.
Each word has been assigned a frequency group, where the frequency group corresponds to a normalized logarithmic scale from 0 to 255. The most frequent word in the corpus has been assigned frequency group 255, and words not appearing in the corpus have been assigned frequency group 0.
Complementary Semantic Annotations
Named Entities Morphology Data
Contains the data regarding named entities comprising person names, places, companies and organizations.
Offensive Language Flag
Contains information per word indicating if the word might be considered offensive in certain contexts.
Regional Variants
In addition to the lexical data for Simplifies Chinese, the Lexical Resource also contains the equivalent lexical data for the following dialects:
- Traditional Chinese (under development).
Volume of Language Data
Total number of forms
75,000 forms
- Verbs: 18,000 forms (24%)
- Nouns: 50,000 forms (66%)
- Adjectives: 5,000 forms (7%)
- Other: 2,000 forms (3%)
Total number of lemmas
75,000 lemmas
Features
Each form is annotated with the lemma (root form), POS, and morphological attributes: tense, person, number, gender, degree.
Lemma
The canonical form for the inflected word.
POS
Part of Speech such as noun, verb, adjective, etc.
Voice
Not applicable.
Tense
Not applicable.
Aspect
Not applicable.
Mood
Not applicable.
Person
Not applicable.
Number
Not applicable.
Gender
Not applicable.
Case
Not applicable.
Degree
Not applicable.
Definiteness State
Not applicable.
Negative
Not applicable.
Contractions
Not applicable.
Pronominal Clitics
Not applicable.
Formality
Not applicable.
Frequency
Relative frequency of the form based on a large general-purpose corpus.
Named Entities
Pre-defined entities are tagged as person names, places, organization, etc.
Offensive
Indicates whether the form might be considered offensive in certain contexts.
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA