Named Entity Recognition

Long Hoang Nguyen September 26, 2016 at 3:49 pm

I am in the group of people designing a bot conversational application Alquist in the eClub Summer Camp 2016. The intelligent bots need to understand what are people asking. The basic two problems are recognizing intent and entity from a user’s query.

My task is creating a Named Entity Recognition (NER) system for Czech. Every Machine Learning algorithm needs to be trained, therefore we had to find a suitable dataset. Luckily, the Institute for Formal and Applied Linguistics created one called CNEC3.0 ( It consists of almost 9000 sentences and a two-level hierarchy of 46 named entities. There are also embedded named entities (a person would have a first name and last name) and special tags (for foreign or misspelled words). For comparison, the English conll2003 dataset ( has only 4 categories but considerably more data. Our task is thus more difficult.

To measure improvement in the training we need to choose a metric. There are several performance metrics. Unfortunately, each published paper used a different one. They measure per entity performance (ignoring embedded named entities), structural (we compare both outside tag and embedded tags) or per_word (IBO inside-outside-begin scheme The latest paper reported an f1 score of 0.84 with all 46 tags (it is not clear if they considered embedded tags or not), using a maximum entropy classifier.

We decided first to use IBO for multiword named entities and we would only use the 8 supertypes (entities)  instead of all 46. This results in 8*2 + 1 (outside) = 17 possible tags. Then we would try two different approaches, one with Conditional Random Fields (CRF) and hand-picked feature functions (extracted from research papers).

Here is an example of an embedded named entity, where A is a special container tag, it’s not a named entity by itself:

<A<ic VYSOKÉ UČENÍ TECHNICKÉ V <gu BRNĚ>>, <gs Antonínská> <ah 548/1>, <gu Brno> <az 601 90>>

Using structural performance metric, we would have to identify the container and all inner tags too. This is very difficult for cases like BRNĚ, which is both a city by itself and a part of another named entity. It is also difficult to train a NER system like that as it requires chunking instead of going word by word.

Per-entity metric ignores the embedded tags and the container:

<ic VYSOKÉ> <ic UČENÍ> <ic TECHNICKÉ> <ic V> <ic BRNĚ> <O ,> <gs Antonínská> <ah 548/1> <O ,> <gu Brno> <az 601> <az 90>

Per-word performance uses IBO instead:

<ic_b VYSOKÉ> <ic_i UČENÍ> <ic_i TECHNICKÉ>  <ic_i V>  <ic_i BRNĚ> <O ,>  <gs_b Antonínská> <ah_b 548/1> <O ,> <gu_b Brno> <az_b 601> <az_i 90>

Per-word performance with supertags (what we use):

<i_b VYSOKÉ> <i_i UČENÍ> <i_i TECHNICKÉ>  <i_i V>  <i_i BRNĚ> <O ,>  <g_b Antonínská> <a_b 548/1> <O ,> <g_b Brno> <a_b 601> <a_i 90>

CRFs try to model the conditional probability p(Y|X) (where X is the observed sequence and Y is a sequence of labels) as a weighted sum of (usually boolean) feature functions, divided by a normalizer. Training the CRF sets the weight for every feature function.

The common features for NER are:

  • morphologic attributes (is_capitalised? suffixes, contains_numbers, etc.) – Named Entities are often capitalised
  • POS tags – We can sometimes deduce the named entity given the sentence structure
  • Bag of words
  • Gazetteers – usually a list of named entities, geolocation/names/organisations
  • Brown Clusters – Brown Clustering is a hierarchical clustering scheme that begins with each word being its own cluster. Then the clusters which cause the smallest loss in global mutual information get merged. This can be seen as a binary tree or a sequence of merges. It results in a binary string for every word. (cat could have 1010, dog 1011) We can take substrings of the code to have different granularities. (in other words, cut the hierarchical clustering at different levels) We used brown-clusters because word-embeddings are very difficult to encode in a CRF. We computed brown-clusters on a dump of Czech Wikipedia.

Examples of feature functions:

  • f(sentence, index, current_word, previous_word) = 1 iff current_word = York and previous_word = New
  • f(sentence, index, current_word, previous_word) = 1 iff is_capitalised(current_word)
  • f(sentence, index, current_word, previous_word) = 1 iff is_NOUN(current_word) and is_VERB(previous_word)


Thanks to the way feature functions are defined, we can also encode feature in a window around the word, for example, we can encode the previous and next POS tags in a feature functions. This gives CRFs a greater modelling power than HMMs, which can only look back one state due to the Markov assumption. The CRF feature functions can ask global questions about the sentence.

Using CRF, the described feature functions we have so far achieved for 17 tags the f1(micro) score 0.925.The further step is improving the accuracy. We plan to experiment with the tags based on the current state of the art algorithms. A large problem we want to attack is how to deal with the morphological richness of Czech. It creates a problem with both embeddings and gazetteers too. Currently, we frequently could not find the specific form of a word in the embedding dictionary/gazetteer, which requires applying lemmatizer or stemmer.

Link to GitHub repository: