Automatic ontology learning from semi-structured data

Filip Masri March 16, 2017 at 5:48 pm
Reconstruction of the relations in the table.

Today I am going to write about the topic of my diploma thesis “Automatic ontology learning from semi-structured data.” I try to exploit semi-structured data like web tables (<html>) to create domain specific ontologies.

What is an ontology?

The term ontology was specified by Thomas Gruber as “An ontology is a specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.”
Two basic building blocks of an ontology are concepts and relations. Concepts represent classes of entities, and their individual members are called instances. Relations among concepts are called semantic relations. Moreover, ontologies can be created for different domains and serve as a foundation for a knowledge database containing instances of given concepts.

What is the approach of the proposed work?

Lots of domain specific information is presented on web pages in tabular data, for example in HTML <table> elements. However, retrieving suitable web tables from pages and reconstructing relations among its entities consists of several subtasks.
First, we have to identify proper tables have for retrieval from the pages. The process is called WEB table type classification. The WEB table header classification identifies rows/columns headers. Finally, the table has to be transformed into an ontology, and that process is called Table understanding.

 

WEB table type classification

There are several types of tables on the web, such as ENTITY, RELATION, MATRIX, LAYOUT, OTHER tables. A machine learning algorithm (Random forest) classifies these types. The classifier uses several features such as an average number of cells, image elements, header elements, cell length deviation, etc.

 

ENTITY TABLE - referring to a single entity

ENTITY TABLE – referring to a single entity

 

LAYOUT TABLE - positioning elements in the web page

LAYOUT TABLE – positioning elements in the web page

 

MATRIX TABLE - taking more complex relations

MATRIX TABLE – taking more complex relations

 

RELATION TABLE - taking more instances of the same class

RELATION TABLE – taking more instances of the same class

 

OTHER TABLE – tables where one is unsure about the content

OTHER TABLE – tables where one is unsure about the content

WEB table header classification

Once we identify the table type, we have to locate the table header. One would object that all header cells are marked with an <th> element. Unluckily, that is not true. Thus, a classification method (again Random Forest) was chosen in order to predict whether a table column/row is HEADER/DATA column/row. Table understanding depends a lot on a correct header location.

Table understanding

The final process is to mine relations among entities from the table. The relations are derived from a table annotated with header location marks. More specifically, the reconstruction of relations uses heuristic rules, resulting in a graph of entities, as shown in the following figure. The MobilePhone is a class. RAM and Item Weight are properties belonging to the MobilePhone, and they have a Quantitative value as its range. Finally, iOS is an instance of an OS class (Operating system) and belongs to the MobilePhone class.

Reconstruction of the relations in the table.

Reconstruction of the relations in the table.

What is the application?

This method can be applied when building domain specific knowledge databases that should be later integrated with more general ontologies/concepts. More domain ontologies are learned by crawling sites with similar content (like mobile phones on amazon.com, gadgedtsndtv.com, etc…). Derived ontologies differ in structure and content. Therefore, methods for merging the ontologies should be the next step in the project.

Part of the ontology generated by crawling gadgetsndtv.com

Part of the ontology generated by crawling gadgetsndtv.com