On the Stanford parser (and Bitext parser)

In some of our recent talks, colleagues have asked us about the Stanford parser and how it compared to Bitext technology (namely at our last workshop on Semantic Analysis of Big Data in San Francisco, and in our presentation in the Semantic Garage also in San Francisco).

We have revisited this parser and, as expected, results were impressive. Sentences taken from news sources get parsed elegantly and with a nice dependency tree for their constituents. The parser is based on a probabilistic approach, i.e., it needs to be trained using a hand-tagged corpus with POS (Part of Speech) tags; typically, the Wall Street Journal corpus from Penn Treebank.

The question arises whether this approach is effective when a different type of text needs to be addressed (such as Social Media content, User Generated Reviews…). In this case, it is necessary to hand-tag new corpora for training the parser.

Let’s think about parsing tweets, which often feature ungrammatical sentences, slang, abbreviations, emoticons… and if this needs to be done in languages other than English, hand-tagging corpora for training a parser can be a hurdle for fully automating tasks like Social Media Analysis.

In Bitext, we have had to face this situation since our customers deal with many different types of data and in multiple languages.

We follow a linguistic approach whereby we can quickly develop grammars which describe the dependency structure of sentences at various level of detail, depending on whether we analyze Social Media content or News texts.

Our linguistic parser is grammar-independent and dictionary-independent, and for a new language we can use the same parser just changing the linguistic data sources and, most importantly, we do not require hand-coded corpora to “train” our parser since it does not require any training.

This is the downside; on the upside, the probabilistic information associated to rules can prove very useful. We are looking into ways to incorporate statistical information in our grammars so that the parser can select the correct analysis in cases where structural ambiguity is pervasive.

A hybrid approach combining linguistic knowledge and statistical information could help in solving these sentences, which are more frequent than we usually think.

admin

Next Automatic IAB tagging enables now semantic ad targeting »

Fine-tuning LLM

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

9 months ago

On the Stanford parser (and Bitext parser)

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

On the Stanford parser (and Bitext parser)

Related Post

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction