AI

On the Stanford parser (and Bitext parser)

In some of our recent talks, colleagues have asked us about the Stanford parser and how it compared to Bitext technology (namely at our last workshop on Semantic Analysis of Big Data in San Francisco, and in our presentation in the Semantic Garage also in San Francisco).

We have revisited this parser and, as expected, results were impressive. Sentences taken from news sources get parsed elegantly and with a nice dependency tree for their constituents. The parser is based on a probabilistic approach, i.e., it needs to be trained using a hand-tagged corpus with POS (Part of Speech) tags; typically, the Wall Street Journal corpus from Penn Treebank.

The question arises whether this approach is effective when a different type of text needs to be addressed (such as Social Media content, User Generated Reviews…). In this case, it is necessary to hand-tag new corpora for training the parser.

Let’s think about parsing tweets, which often feature ungrammatical sentences, slang, abbreviations, emoticons… and if this needs to be done in languages other than English, hand-tagging corpora for training a parser can be a hurdle for fully automating tasks like Social Media Analysis.

In Bitext, we have had to face this situation since our customers deal with many different types of data and in multiple languages.

We follow a linguistic approach whereby we can quickly develop grammars which describe the dependency structure of sentences at various level of detail, depending on whether we analyze Social Media content or News texts. 

Our linguistic parser is grammar-independent and dictionary-independent, and for a new language we can use the same parser just changing the linguistic data sources and, most importantly, we do not require hand-coded corpora to “train” our parser since it does not require any training.

This is the downside; on the upside, the probabilistic information associated to rules can prove very useful. We are looking into ways to incorporate statistical information in our grammars so that the parser can select the correct analysis in cases where structural ambiguity is pervasive. 

A hybrid approach combining linguistic knowledge and statistical information could help in solving these sentences, which are more frequent than we usually think.

admin

Recent Posts

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

3 days ago

Bitext NAMER Cracks Named Entity Recognition

Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…

2 weeks ago

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

6 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

7 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

8 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

10 months ago