Usually, in this blog, we write about text analysis products such as lemmatizers or parsers and how they can help to solve issues in products that need an accurate understanding of text to function.
But today, we want to show you also what is behind our technology, how we are able to create it. That is why we decided to interview one of our expert linguists, Clara Garcia, to provide some insights.
One of our team’s latest projects involved creating a morphological analyzer for Tagalog.
In inflected languages, words are formed through morphological processes such as affixation. For example, by adding the suffix ‘-s’ to the verb ‘to dance’, we form the third person singular ‘dances’.
A morphological analyzer assigns the attributes of a given word by evaluating what morphological processes the form has undergone. If you give it the word ‘bailaré’ in Spanish, it will tell you it is the first person, singular, simple future, indicative form of the verb ‘bailar’.
A tool like this involves analyzing the grammar of the language, creating morphological models of how each POS inflects, and then creating a software and adapting those models to automatically detect what attributes will be assigned to a particular form of the language. In this case I will talk specifically about Tagalog since we just developed a morphological analyzer for it.
The first difference between common and more “exotic” languages is the amount of literature and resources you can get. Finding enough literature to create all morphological models for an “exotic” language can be challenging.
Apart from that, the creation of the tool depends on the language inflectional system and whether it is very complex or relatively simple.
It is doable in both cases, but a more complex inflectional system will also require a more complex software. Both the English and Tagalog inflectional systems are manageable enough to create a fine-grained analyzer.
Tagalog is mainly spoken in the Philippines and it belongs to the Austronesian family. As I said, its inflectional system is manageable, only verbs and pronouns inflect.
I will focus on verbs that are a bit more complex. They inflect to mark aspect, focus/voice and mood, and they do it through affixation and reduplication.
Tagalog is different from other languages in that it uses reduplication to mark aspect (most languages that use it, do so to mark intensity, form plurals, or for onomatopoeia among some other uses).
It also has a rich affixation system with suffixes, infixes, prefixes and circumfixes, that mark focus and mood. It is therefore, an interesting language for this kind of morphological tool.
I will show a couple examples of the steps to follow in Tagalog:
For the contemplated form of the verb ‘to eat’ (seen in the table above), we follow these steps:
For the progressive form of the verb ‘to read’, we follow these steps:
Like with most languages, the hardest part of analyzing morphologically is to cover all the possible phonological processes in the language. It is challenging to account for all these, particularly the least productive ones, what we commonly call exceptions.
One example of these phonological processes in Tagalog would be the roots that have ‘o’ as the vowel in the final syllable. They change ‘o’ into ‘u’ when a suffix is added: ‘suntok’ > ‘suntukin’. To account for all these, when there can be dozens of these processes is one of the main difficulties I found.
This tool can be helpful for many tasks, some but not all related to NLP.
In the ‘assignment of attributes’ process, the word is lemmatized and stemmed. Knowing the reduplication and/or affixes that apply to a word, we can find its lemma. This is useful for many NLP processes, for example concordances or POS tagging.
It can also be applied toward search engines so when you look up an inflected verb or noun it finds its lemma and suggests everything in that field. For example, if you look up something about ‘walking’, you will get more results if the search engine is able to know that the verb base form is ‘walk’, and from there it accesses the whole verb paradigm
This tool can also help us with indexation of databases and therefore with information retrieval. We don’t just have words but also their attributes, lemma, and stem. We can easily access specific elements through all this information.
A morphological analyzer can also be used as part of a machine translation system, reducing the complexity of the input and helping to understand the syntax. The words become a bag full of information pieces (lemma + tense + aspect + person etc.)
If you want to replicate our morphological analyzer download our presentation with the python script and some examples:
Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…
Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…
In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…
Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…
Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…
GPT and other generative models tend to provide disparate answers for the same question. Having…