Today I am going to write about the topic of my diploma thesis “Automatic ontology learning from semi-structured data.” I try to exploit semi-structured data like web tables (<html>) to create domain specific ontologies.
What is an ontology?
What is the approach of the proposed work?
WEB table type classification
There are several types of tables on the web, such as ENTITY, RELATION, MATRIX, LAYOUT, OTHER tables. A machine learning algorithm (Random forest) classifies these types. The classifier uses several features such as an average number of cells, image elements, header elements, cell length deviation, etc.
WEB table header classification
Once we identify the table type, we have to locate the table header. One would object that all header cells are marked with an <th> element. Unluckily, that is not true. Thus, a classification method (again Random Forest) was chosen in order to predict whether a table column/row is HEADER/DATA column/row. Table understanding depends a lot on a correct header location.
The final process is to mine relations among entities from the table. The relations are derived from a table annotated with header location marks. More specifically, the reconstruction of relations uses heuristic rules, resulting in a graph of entities, as shown in the following figure. The MobilePhone is a class. RAM and Item Weight are properties belonging to the MobilePhone, and they have a Quantitative value as its range. Finally, iOS is an instance of an OS class (Operating system) and belongs to the MobilePhone class.
What is the application?
This method can be applied when building domain specific knowledge databases that should be later integrated with more general ontologies/concepts. More domain ontologies are learned by crawling sites with similar content (like mobile phones on amazon.com, gadgedtsndtv.com, etc…). Derived ontologies differ in structure and content. Therefore, methods for merging the ontologies should be the next step in the project.