Recap of the last months aka. how we teach Alquist to talk

Petr Marek
April 8, 2017 at 5:09 pm
Roman working on Alquist

Lots of things happened during the last months. The biggest new is that Alquist will go to testing this Monday. Despite the original plan, it will be available only to the Amazon employees for next thirty days, but still, they are the real users. I can’t wait to see how Alquist will perform.

The Alquist evolved a lot since the last blog post. It progressed from some simple and without any purpose conversations to the focused speaker. Alquist can now speak about movies, sports results, news, holidays and encyclopedic knowledge, and about books and video games very soon.

How do we know which topics Alquist should learn? Amazon offered all teams the possibility to run closed beta test. We used this opportunity of course, as some of you might know. We decided to make more “open” beta test because we had space for 2000 testers. So we publicly announced the test and waited for the result. We used an only tiny fraction of space available, to be honest. But it was still enough to learn from mistakes and to find ideas how to improve Alquist. I would like to thank all of you, who helped us. Thank you!

The public launch should happen at the beginning of May. Until then you can follow the progress of Alquist on the Twitter or Facebook, where you can find some cool demo videos of Alquist in action.

Voice Controlled Smart Home

Petr Kovar
March 24, 2017 at 3:29 pm

Do you remember Iron Man’s personal assistant called J.A.R.V.I.S.? It is just a fictional technology from a superhero movie, and I am getting close with HomeVoice. HomeVoice is designed to become your personal voice controlled assistant whose primary task is to control and secure your smart home. You can switch the lights, ask for a broad range of values (temperature, humidity, light states, etc.), manage your smart home devices and also provide the HomeVoice with your feedback to make it even better.

Let’s start at the beginning. My name is Petr Kovar, and I study cybernetics and robotics at CTU in Prague. I came to eClub Prague more than a year ago to participate in the development of Household Intelligent Assistant called Phoenix. Under the supervision of Jan Sedivy I built-up sufficient know-how about speech recognition, natural language understanding, speech synthesis and bots in general. A few months later I turned to Jan Sedivy again for help with a specification of my master’s thesis.

As time went on, we decided to utilize the accumulated experience for the development of a voice controlled smart home. I started with the selection of smart home technology. I decided to use Z-Wave the leading wireless home control technology in the market. I have selected the Raspberry Pi as a controller. It runs the Raspbian equipped with Z-Wave module and Z-Way control software.

The main task was to monitor my house by voice using a mobile device. I decided to write an Android app called HomeVoice. The app turns any Android tablet or smartphone into a smart home remote control. It works both locally and over the internet (using remote access via find.z-wave.me). Whereas other Z-Way Android apps offer only one-way communication (tablet downloads data from the control unit on demand), HomeVoice receives the push notifications informing the user as soon as a control unit discovers an alarm or something urgent. Imagine that you are at work when suddenly a fire occurs in your home. HomeVoice informs you about it in less than 500 ms which gives you enough time to ensure appropriate rescure actions.

HomeVoice supports custom hot-word detection (similar to “Hey, Siri” or “Ok, Google”), transcribes speech to text, understands natural language and responds using synthesized speech. Many different technologies are used to achieve this behavior from CMUSphinx (hot-word detection), through SpeechRecognizer API and cloud service wit.ai (natural language understanding) to TextToSpeech API (speech synthesis). HomeVoice interconnects all these technologies into a complex app and adds its context processing and dialog management.

It is still quite far from Iron Man’s J.A.R.V.I.S., but I hope that someday HomeVoice will become the usefull smart home assistant.

Automatic ontology learning from semi-structured data

Filip Masri
March 16, 2017 at 5:48 pm
Reconstruction of the relations in the table.

Today I am going to write about the topic of my diploma thesis “Automatic ontology learning from semi-structured data.” I try to exploit semi-structured data like web tables (<html>) to create domain specific ontologies.

What is an ontology?

The term ontology was specified by Thomas Gruber as “An ontology is a specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.”
Two basic building blocks of an ontology are concepts and relations. Concepts represent classes of entities, and their individual members are called instances. Relations among concepts are called semantic relations. Moreover, ontologies can be created for different domains and serve as a foundation for a knowledge database containing instances of given concepts.

What is the approach of the proposed work?

Lots of domain specific information is presented on web pages in tabular data, for example in HTML <table> elements. However, retrieving suitable web tables from pages and reconstructing relations among its entities consists of several subtasks.
First, we have to identify proper tables have for retrieval from the pages. The process is called WEB table type classification. The WEB table header classification identifies rows/columns headers. Finally, the table has to be transformed into an ontology, and that process is called Table understanding.

 

WEB table type classification

There are several types of tables on the web, such as ENTITY, RELATION, MATRIX, LAYOUT, OTHER tables. A machine learning algorithm (Random forest) classifies these types. The classifier uses several features such as an average number of cells, image elements, header elements, cell length deviation, etc.

 

ENTITY TABLE - referring to a single entity

ENTITY TABLE – referring to a single entity

 

LAYOUT TABLE - positioning elements in the web page

LAYOUT TABLE – positioning elements in the web page

 

MATRIX TABLE - taking more complex relations

MATRIX TABLE – taking more complex relations

 

RELATION TABLE - taking more instances of the same class

RELATION TABLE – taking more instances of the same class

 

OTHER TABLE – tables where one is unsure about the content

OTHER TABLE – tables where one is unsure about the content

WEB table header classification

Once we identify the table type, we have to locate the table header. One would object that all header cells are marked with an <th> element. Unluckily, that is not true. Thus, a classification method (again Random Forest) was chosen in order to predict whether a table column/row is HEADER/DATA column/row. Table understanding depends a lot on a correct header location.

Table understanding

The final process is to mine relations among entities from the table. The relations are derived from a table annotated with header location marks. More specifically, the reconstruction of relations uses heuristic rules, resulting in a graph of entities, as shown in the following figure. The MobilePhone is a class. RAM and Item Weight are properties belonging to the MobilePhone, and they have a Quantitative value as its range. Finally, iOS is an instance of an OS class (Operating system) and belongs to the MobilePhone class.

Reconstruction of the relations in the table.

Reconstruction of the relations in the table.

What is the application?

This method can be applied when building domain specific knowledge databases that should be later integrated with more general ontologies/concepts. More domain ontologies are learned by crawling sites with similar content (like mobile phones on amazon.com, gadgedtsndtv.com, etc…). Derived ontologies differ in structure and content. Therefore, methods for merging the ontologies should be the next step in the project.

Part of the ontology generated by crawling gadgetsndtv.com

Part of the ontology generated by crawling gadgetsndtv.com

New projects – join us!

Jan Sedivy
March 13, 2017 at 8:03 am

eClub will again organize the Summer Camp (ESC). The ESC 2016 was incredibly successful. Five eClubbers worked on a question answering system YodaQA. At the end of summer, they have entered the Amazon Alexa Prize competition and got in between top twelve selected for development of a social bot. They received $100k scholarship for a social bot development. Currently, they are busy working on the first version of Alquist.

We would like to continue in the direction of developing dialog applications in ESC 2017. The social bot task is very challenging, and many required technologies are still in development. While developing YodaQA, we have looked at the well known classical NLP algorithms as well as to new mainly neural networks based ones, such as LSTM, GRU, etc. to process text in many different ways.

It is beginning of March, but we are already prepared to incubate new, eager students interested joining us on this journey toward smarter systems. We have strong support not only from Amazon but also from the local company Seznam. Seznam is one of the few competing successfully with Google on the domestic market. They are 100% Machine Learning company with a lot of problems of mutual interest.

Here is a sneak preview of the projects for this year. Join us, and work with us tomorrow, if you are interested. We are offering scholarship equivalent to what you would earn working for a company. We are moving to the new CVUT building, it is gorgeous, as soon as it opens, it is just a matter of weeks. Join us! You can start any time.

Automatic email reply generation
In this project, we want to research methods for automatic generation of short responses to emails or social networks messages. Specifically, on a cell phone, it can be a great advantage to select from a selection of semantically diverse replies. We want first cover few words long messages. The initial steps will include a review of Recurrent Neural Networks architectures and a meaningful training set construction.

Amazon Echo conversational application
We have a set of tasks we would like to cover as a spoken dialog. We want to design interactive conversational bots for Amazon Echo. We want the application to be engaging, entertaining and informative, bringing the user latest news from specific areas, such as sports, celebrities, movies, etc. This project is OK for students who are only entering the field with no or small experience.

Knowledge extraction
There is vast of information on the Internet. A lot of the information is in the form of a text; the information is unstructured. In this project, we want to review the methods for retrieving and extracting the information, learning the dependencies of statements in the texts. We want to create ontologies from a selected, limited content and store the knowledge for further use. These are very challenging problems but do not hesitate to join. We have students currently working on these topics. We know what the first steps are.

Text summarization
Journalists write the Internet news in a particular language, frequently using idioms, slang or infrequent expressions. In this project, we want to extract what is important and create a summary in a clear language. Initially, we want to summarize long sentences. Next, select a suitable method, implement and test it on a chosen domain.

Events Extraction from Text
This project is only an extension of the previous one. We want to design and implement a system for extracting events from the Internet. The primary goal is selecting news messages based on identified topics (or events). Extraction of economic events like mergers & acquisitions, stock splits, dividend announcements, etc., play an essential role in decision making, in risk analysis applications and monitoring systems.

Young Transatlantic Innovation Leaders Initiative

Radka Lamková
January 19, 2017 at 4:44 pm
young innovation leaders - sedivy

 

Emerging entrepreneurs and innovative community leaders can now apply for the 2017 Young Transatlantic Innovation Leaders Initiative (YTILI).  Sponsored by the U.S. Department of State., it will give 100 outstanding leaders from across Europe the chance to expand their leadership and entrepreneurial experience through fellowships at businesses and civil society organizations across the U.S. in 2017. Selected Fellows will build networks and lasting partnerships to attract investments and support for their entrepreneurial ventures.

Applications are due by February 6, 2017.  You can find more information on the YTILI  Fellowship eligibility criteria and the application instructions at: ytili.state.gov.

In addition to the YTILI Fellowship Program, the YTILI Network will offer ongoing professional and networking opportunities to anyone interested in being connected.  As part of the YTILI Network, you’ll have the chance to connect with senior business and NGO leaders in the United States and Europe working to create change in their communities.

Copied from ytili

We were selected to The Alexa Prize

Radka Lamková
January 17, 2017 at 10:26 am
Alexa Prize

The Alexa Prize is competition organized by Amazon for student teams. The goal is to create chatbot for the Amazon Alexa, which should be able to communicate with human about any topic like politics, sport, news, movies and so on. The grand challenge is to engage human for twenty minutes. Our team from eClub Prague is participating!

Alquist the stupid

I was working in the student research incubator eClub during summer. My job was to create a system, which allows other developers to write chatbots easily. The system was intended for leading company of the Czech internet. The company wanted to use chatbots as assistants for internet shopping. So Alquist the dialogue manager (1.0) was born. However, the company lost its interest because of some inner changes. You can try the demo here (the loading takes some time because we use slow server offered for free). This was the beginning of next events.

Alquist 2.0 the smart

ALQUIST

We discovered Alexa Prize during October. We decided to form a team and enroll. Unthinkable happened. We were selected as one of the twelve teams out of hundred registered teams across the globe. It happened probably thanks to our experience with chatbots (Alquist the dialogue manager 1.0) and with question answering (YodaQA). Our team consists of leader Jan Pichl, Jakub Konrád, Long Hoang Nguyen (a.k.a. Roman), Matin Matulík and our faculty advisor Jan Šedivý (and me of course). I am still amazed, that we were selected together with the teams from universities as Berkley, Princeton or Carnegie Mellon.

We received free access to Amazon AWS, few Alexa devices and some money helping us to develop the best chatbot. We decided to call our chatbot Alquist again. It is the same name as we used for the system which I was developing during summer. But it would be a pity to use it for relatively stupid (in terms of AI) system. Alquist is the name of a character from Karel Čapek’s R.U.R. This play, created by one of the best Czech author, brings great symbolism. Word “robot” was used there for the first time. And Alquist sounds good too.

rur

We have read many scientific papers on the topic of intelligent chatbots. Out current approach is to combine neural networks with a little bit of RDF databases and some predefined rules. I can’t be more precise now. The working prototypes of all teams will be available at the beginning of April. And the end of the competition is planned for the November.

I am really thrilled because we are working on the cutting edge technology which can have a huge impact and can extend abilities of humankind. It is nothing like I experienced during game development. I believe that this competition can bring us great conversation AI created by any team. That would be amazing!

Copied from petr-marek.com/blog

YodaQA on Docker

Martin Matulík
October 13, 2016 at 11:35 am

YodaQA on Docker

screen-shot-2016-10-08-at-2-37-50-pm

YodaQA is a question answering system started by Petr Baudis and currently being developed in eClub. YodaQA is a quite complex set of several NLU and ML components. These include datasets to look up potential answers, parsers and other text processing tools to analyse a question, web front-ends etc. All components are implemented as web services which can be used separately as well as in the main QA pipeline. The main pipeline is built on top of Apache UIMA supporting communication between components. All components also offer REST APIs to use them for other projects. My task was to make the components easy to use and make them portable. I have decided to use the Docker deployment service.

What is Docker?

Docker is a platform for web applications that allows them to run in sandbox environments (called containers) hosted in an operating system such as Linux. The advantage of containers over more commonly used virtual machines is their efficiency and low overhead. Another Docker advantage is the simplicity of creating, configuring and deploying the applications inside the containers. It can be done easily with few commands. The Docker Compose tool even allows the user to easily run multiple applications communicating among themselves. The containers are created from blueprints called images. Docker users share images and can improve existing ones.

Deploying to Docker

To run a container, you need an image. Each image is derived from a base image, e.g. image of an application written in Python language will be based on Python image. The base images are available from Docker repository.

The images are created using scripts called Dockerfiles.

vystrizek

The Dockerfile is a sequence of commands which first sets the OS environment of a container. This includes obtaining the base image (using FROM keyword), putting everything required for running the application (like code itself, dependencies or basic shell for manual control – using ADD keyword) and opening a communication port for APIs (using EXPOSE keyword).

Once the Dockerfile is completed or obtained, the image can be built with docker build command. The container can then be run using docker run command.

YodaQA Docker cloud

The address of the server is cloud.ailao.eu. Currently, YodaQA is composed from these Docker components:

  • Live demo – Main question answering client. It is available on port 4567
  • Other versions of the demo – Another two versions of YodaQA. First answers questions related to movies (4000), the other is able to answer questions in Czech.
  • Datasets – DBpedia (3037) and Wikipedia (8983) data dumps.
  • Labels – this component allows the user to link his query string to a DBpedia entity (5000, 5001).
  • Czech parser – parser built on Google’s Syntaxnet which is able to assign part-of-speech tags to Czech words (4571).
  • Javadoc – whole documentation generated for main YodaQA client (13880).

Documentation

The YodaQA Javadoc documentation is on the server. The more detailed documentation are wiki pages that are available at http://3c.felk.cvut.cz/dokuwiki/doku.php?id=yodaqa. They cover all parts of the question answering pipeline as well as information about setup and function of various external components.

Knowledge base completion

Michal Pokorný
October 7, 2016 at 11:01 am
2016-10-05-knowledge-vault

This blog post is mirrored on eClub’s blog from my personal homepage.

My name is Michal and I came to eClub Prague to work on an awesome master’s thesis. I am interested in AI and ML applications and I sought the mentorship of Petr Baudiš aka Pasky. The project we settled on for me is researching knowledge base completion.

You may already know something about knowledge bases. Knowledge bases, also known as knowledge graphs, are basically knowledge represented as graphs: vertices are entities (e.g., Patricia Churchland, neurophilosophy, University of Oxford) and edges are relations between entities (e.g., Patricia Churchland studied at University of Oxford).

A small neighbourhood of Patricia Churchland in a hypothetical knowledge base

A small neighbourhood of Patricia Churchland in a hypothetical knowledge base

Knowledge graphs are useful[citation needed]. The most famous knowledge graph is Google’s eponymous Knowledge Graph. It’s used whenever you ask Google a question like “Which school did Patricia Churchland go to?”, or maybe “patricia churchland alma mater” – the question is parsed into a query on the graph: “Find a node named Patricia Churchland, find all edges going from that node labeled studied at, and print the labels of nodes they point at.” Look:

How Google uses knowledge bases

How Google uses knowledge bases

Google’s Knowledge Graph is based on Freebase (wiki), which is now frozen, but its data is still publicly available. Other knowledge bases include DBpedia, which is created by automatically parsing Wikipedia articles, and Wikidata, which is maintained by an army of volunteers with too much free time on their hands.

Because knowledge graphs are useful, we would like them to contain all true facts about the world, but the world is big[citation needed]. And since we want to represent everything in knowledge graphs, we’d need a really big number of editors. Real-life knowledge graphs miss a lot of facts. Persons might not have their nationalities assigned, cities might be missing their population numbers, and so on.

Someone had the bright idea that maybe we could replace some of the editing work by AI, and indeed, so we can! When we try to add missing true facts into a partially populated knowledge base, we are doing knowledge base completion.

Researchers have thrown many wildly different ideas at the problem, and some of them stuck. For example:

  • Extracting relations from unstructured text. This means I take a knowledge base, some piece of text (I’m using Wikipedia articles), and I try to fill the gaps using the text.
    The canonical approach is to train a classifier for each relation type, for example the relation actor played in movie. The classifier takes a sentence as its input, and outputs a number between 0 and 1, which represents the classifier’s estimate on the likelihood this sentence represents the relation. So, on sentences like University of Oxford is in Oxford, UK. or Marie Curie died of radium poisoning., we’d like to see low scores. On the other hand, sentences like Arnold Schwarzenegger played in the movie Terminator. should score close to 1.
    A problem with this approach is that we’d need a really large training set to train our classifier well, and who has the time to label 10k+ sentences those days? Fortunately, we can use a neat trick called distant supervision. Interested? Read about it in this paper.
  • Mining for graph patterns. In real life, we know that if Peter is the father of John and Kate is John’s mother, it’s pretty likely that Peter and Kate might be married. So, if our knowledge base contains the facts Peter is the father of John and Kate is the mother of John, then if the knowledge base doesn’t say that Peter is married to Kate, we could expect that to be an error of omission. On the other hand, if we add the fact that Peter is married to Marie, that counts as evidence against Peter being married to Kate.
    There are algorithms that look for such patterns (and more complex ones) in the incomplete knowledge graph and then use these patterns on the same graph to assign likelihoods to missing relations. One is called PRA (Path Ranking Algorithm), another one SFE (Subgraph Feature Extraction). Matt Gardner has an implementation of both.
  • Embeddings. This means that we invent a space with, say, 50 dimensions, and somehow represent entities and relations within that space. We choose this representation so that the embedding then informs us about which relations might be true, but missing in the knowledge graph.
    For example, say we represent people as points in a 2D plane, and we represent the is mother of relation as a step to the right and up, and the is father of relation as a step to the right and down. Of course, you can’t really represent all the complexity of family relationships this way, but if you tried to get as close as you could, you’d end up with a figure where two people who are siblings would probably end up close together. Behold: embedding according to is mother of and is father of just told us something about who is sibling of who!

In my thesis, I plan to replicate the architecture of Google’s Knowledge Vault (paper, wiki).

Google created Knowledge Vault, because they wanted to build a knowledge graph even bigger than The Knowledge Graph.

The construction of Knowledge Vault takes Knowledge Graph as its input, and uses three different algorithms to infer probabilities for new relations. One of these algorithms extracts new relations from webpages. The second one uses PRA to predict new edges from graph patterns. The third one learns an embedding of Knowledge Graph and predicts new relations from this embedding.

Each of these different approaches yields its own probability estimates for new facts. The final step is training a new classifier that takes these estimates and merges them into one unified probability estimate.

Finally, you take all the predicted relations and their probability estimates, you store them, and you have your own Knowledge Vault. Unlike the input knowledge graph, this output is probabilistic: for each Subject, Relation, Object triple, we also store the estimated probability of that triple being true. The output is much larger than the input graph, because it needs to store many edges that weren’t in the original knowledge graph.

Why is this useful? Because the indiviual algorithms (extraction from text, graph pattern mining and embeddings) have complementary strengths and weaknesses, so combining them gets you a system that can predict more facts.

Simplified Knowledge Vault architecture

Simplified Knowledge Vault architecture

My system is open-source and extends Wikidata. The repository is at https://github.com/MichalPokorny/master.

So far, I have been setting up my infrastructure. A week back, I finally got the first version of my pipeline, with the stupidest algorithm and the smallest data I could use, running and predicting end-to-end!

The system generated 116 predictions with an estimated probability higher than 0.5. Samples include:

Subject Relation Object Probability Is this fact true?
Northumberland occupation Film director 0.5913 False
mathematician occupation Film director 0.6201 False
Jacob Zuma member of political party Zulu people 0.5159 False
Mehmed VI country of citizenship Ottoman Empire 0.5479 True
Brian Baker country of citizenship Australia 0.5523 False
swimming occupation Film director 0.6229 False
West Virginia member of political party Tennessee 0.5107 False
Tamim bin Hamad Al Thani country of citizenship Princely Family of Liechtenstein 0.5289 False
Sheldon Whitehouse member of political party Washington, D.C. 0.5086 False
Liberation Tigers of Tamil Eelam country of citizenship United States 0.5349 False
Italian Communist Party country of citizenship United States 0.5545 False
Henri Matisse country of citizenship French 0.5004 False
Lawrence Ferlinghetti country of citizenship United States 0.5471 True
Michael Andersson occupation Film director 0.6283 False
Brian Baker country of citizenship Canada 0.5187 False
John E. Sweeney member of political party Republican Party 0.5036 True

Okay sooo… not super impressive, but pretty good for a first shot. At least it does a bit better than rolling a bunch of dice :)

It’s basically a logistic regression over a bag of words. The dataset are 10 000 Wikipedia articles about persons. My task now is to use smarter algorithms to get better results, adding other algorithms (graph patterns and embeddings), running over a larger dataset and fleshing out the architecture.

I’d enjoy talking at length about the various design choices and cool tools I’m using, but I was told 1 A4 page would be quite enough, so I’ll cut my proselytization short just about now.

Let’s get back to work. Have a fine day, and may your values be optimally satisfied!

Named Entity Recognition

Long Hoang Nguyen
September 26, 2016 at 3:49 pm
wordcloud

I am in the group of people designing a bot conversational application Alquist in the eClub Summer Camp 2016. The intelligent bots need to understand what are people asking. The basic two problems are recognizing intent and entity from a user’s query.

My task is creating a Named Entity Recognition (NER) system for Czech. Every Machine Learning algorithm needs to be trained, therefore we had to find a suitable dataset. Luckily, the Institute for Formal and Applied Linguistics created one called CNEC3.0 (http://ufal.mff.cuni.cz/cnec/cnec2.0). It consists of almost 9000 sentences and a two-level hierarchy of 46 named entities. There are also embedded named entities (a person would have a first name and last name) and special tags (for foreign or misspelled words). For comparison, the English conll2003 dataset (http://www.cnts.ua.ac.be/conll2003/ner/) has only 4 categories but considerably more data. Our task is thus more difficult.

To measure improvement in the training we need to choose a metric. There are several performance metrics. Unfortunately, each published paper used a different one. They measure per entity performance (ignoring embedded named entities), structural (we compare both outside tag and embedded tags) or per_word (IBO inside-outside-begin scheme https://en.wikipedia.org/wiki/Inside_Outside_Beginning). The latest paper reported an f1 score https://en.wikipedia.org/wiki/F1_score of 0.84 with all 46 tags (it is not clear if they considered embedded tags or not), using a maximum entropy classifier.

We decided first to use IBO for multiword named entities and we would only use the 8 supertypes (entities)  instead of all 46. This results in 8*2 + 1 (outside) = 17 possible tags. Then we would try two different approaches, one with Conditional Random Fields (CRF) and hand-picked feature functions (extracted from research papers).

Here is an example of an embedded named entity, where A is a special container tag, it’s not a named entity by itself:

<A<ic VYSOKÉ UČENÍ TECHNICKÉ V <gu BRNĚ>>, <gs Antonínská> <ah 548/1>, <gu Brno> <az 601 90>>

Using structural performance metric, we would have to identify the container and all inner tags too. This is very difficult for cases like BRNĚ, which is both a city by itself and a part of another named entity. It is also difficult to train a NER system like that as it requires chunking instead of going word by word.

Per-entity metric ignores the embedded tags and the container:

<ic VYSOKÉ> <ic UČENÍ> <ic TECHNICKÉ> <ic V> <ic BRNĚ> <O ,> <gs Antonínská> <ah 548/1> <O ,> <gu Brno> <az 601> <az 90>

Per-word performance uses IBO instead:

<ic_b VYSOKÉ> <ic_i UČENÍ> <ic_i TECHNICKÉ>  <ic_i V>  <ic_i BRNĚ> <O ,>  <gs_b Antonínská> <ah_b 548/1> <O ,> <gu_b Brno> <az_b 601> <az_i 90>

Per-word performance with supertags (what we use):

<i_b VYSOKÉ> <i_i UČENÍ> <i_i TECHNICKÉ>  <i_i V>  <i_i BRNĚ> <O ,>  <g_b Antonínská> <a_b 548/1> <O ,> <g_b Brno> <a_b 601> <a_i 90>

CRFs try to model the conditional probability p(Y|X) (where X is the observed sequence and Y is a sequence of labels) as a weighted sum of (usually boolean) feature functions, divided by a normalizer. Training the CRF sets the weight for every feature function.

The common features for NER are:

  • morphologic attributes (is_capitalised? suffixes, contains_numbers, etc.) – Named Entities are often capitalised
  • POS tags – We can sometimes deduce the named entity given the sentence structure
  • Bag of words
  • Gazetteers – usually a list of named entities, geolocation/names/organisations
  • Brown Clusters – Brown Clustering https://en.wikipedia.org/wiki/Brown_clustering is a hierarchical clustering scheme that begins with each word being its own cluster. Then the clusters which cause the smallest loss in global mutual information get merged. This can be seen as a binary tree or a sequence of merges. It results in a binary string for every word. (cat could have 1010, dog 1011) We can take substrings of the code to have different granularities. (in other words, cut the hierarchical clustering at different levels) We used brown-clusters because word-embeddings are very difficult to encode in a CRF. We computed brown-clusters on a dump of Czech Wikipedia.

Examples of feature functions:

  • f(sentence, index, current_word, previous_word) = 1 iff current_word = York and previous_word = New
  • f(sentence, index, current_word, previous_word) = 1 iff is_capitalised(current_word)
  • f(sentence, index, current_word, previous_word) = 1 iff is_NOUN(current_word) and is_VERB(previous_word)

 

Thanks to the way feature functions are defined, we can also encode feature in a window around the word, for example, we can encode the previous and next POS tags in a feature functions. This gives CRFs a greater modelling power than HMMs, which can only look back one state due to the Markov assumption. The CRF feature functions can ask global questions about the sentence.

Using CRF, the described feature functions we have so far achieved for 17 tags the f1(micro) score 0.925.The further step is improving the accuracy. We plan to experiment with the tags based on the current state of the art algorithms. A large problem we want to attack is how to deal with the morphological richness of Czech. It creates a problem with both embeddings and gazetteers too. Currently, we frequently could not find the specific form of a word in the embedding dictionary/gazetteer, which requires applying lemmatizer or stemmer.

Link to GitHub repository:
https://github.com/nguyeho7/CZ_NER

Alquist dialogue manager

Petr Marek
September 25, 2016 at 8:25 am
alquist-yaml

Intelligent chat bots are hot and we have decided to design our own during the eClub Summer Camp 2016. Bots are streamlining the user’s interaction to a messenger type of applications creating the same UI for many apps. They relieve users from the maze of id/passwords and downloading app for each service. The users just log in to their preferred messaging app, load the bot’s profile and start talking to it. Bots can also be huge money savers for companies. They don’t have to build expensive apps for their business, they just integrate bot to a messaging service and communicate and offer their services that way.

 

Our goal is to simplify bots development. With our framework a developer does not have to write the whole bot each time from scratch. The Alquist dialogue manager (ADM) is our solution. The design does not provide an interactive UI. The Alquist framework relies on standards such as YAML, JSON leaving large space for customization. We also expect that the NLU models will be developed from larger data sets. For NLU text processing of larger set are the standards much more suitable. The initial assumption is also to run the Alquist DM on premise. I will explain, why is it special and why should you care.

 

The first important feature is versatility. The dialogue is defined by nodes in the YAML file. The ADM executes dialogue by walking through nodes and jumps between them as bot’s creator defines. In each dialog note the developer has a choice of procedures showing a text, NLU processing, saving data into a context (that is how we call bot’s memory), comparing data from context or showing predefined answers as buttons etc. You can mix all of these nodes or add some of your own, creating a unique bot.

 

We also implemented a change of intent during the dialogue capability. What does it mean? If you have a bot talking about weather and news, you can change between these two topics anytime. During the news conversation, you can change the topic to weather by a single sentence with “weather” intent. Bot’s creator can define how intent triggers in which state will be processed to make step to the next node. This feature is great for more advanced projects.

 

In order for the bot to understand what you are saying, it uses natural language understanding (NLU). We currently use Alquist’s NLU implemented by Wit.ai. However, this is only a tempory solution we are developing our own tools. The main reason is to support other languages.

 

We are constantly developing, improving and adding new features into Alquist. Alquist is an open source project. You can view the whole Alquist’s code and documentation on Github https://github.com/AlquistManager. Don’t forget to try the actual demo at https://alquistmanager.github.io/alquist-client/?e=https://alquist.herokuapp.com.

 

Btw. Alquist is named after the character from R.U.R. by Karel Čapek.