Understanding the skills in a company using artifical intelligence
Unstructured data is available in various forms within the company and on the internet: from office files to internal wiki articles and forum posts to external social media accounts and blog posts. Everywhere is data that reveals something about projects and people. The problem: This data is unstructured, they do not follow any particular scheme, contain important and unimportant information, and are written in different languages.
Despite these challenges, our task is to aggregate the data in such a way that meaningful profiles and descriptions are created, so that an employee, for example, finds a meaningful project profile at a glance. How exactly does this work? We rely on state-of-the-art machine learning technologies, established open source tools and our own implementations. Key technologies include:
• Natural Language Processing (NLP)
• Ontology Learning
• Web Crawling
• Scalable Search Engines
Natural language processing (NLP) combines natural language processing algorithms. With the help of appropriate algorithms such as part-of-speech tagging and dependency parsing, for example, word types can be identified and connected sentence fragments can be extracted. It is, therefore, possible to distinguish potentially important technical terms from simple auxiliary verbs and other unimportant terms, or to find particularly meaningful sentences in a text. In order to identify important terms reliably, it helps when there is an ontology, that is, a kind of branched dictionary that the algorithms can use. Semantic relations, for example between similar technological terms, additionally improve search and matching functions later. The creation and, in particular, the maintenance of such an ontology is extremely complicated and hardly possible manually. For this reason, we use an ontology learning approach that was specifically designed for this.
As the name implies, ontology learning is not about creating an ontology by hand, but automatically learning terms and their connections. For this purpose we rely on manually predefined terms and relations, which are then compared with other terms and successively extended through machine learning methods in the form of a neural network. For this purpose, we analyze large pools of data and the terms in them are then translated into mathematical representations, the vectors (Word2Vec). In doing so, semantic similarities are adapted in the vectors. By comparison with the terms already known through clustering, we can then determine which terms are potentially relevant and in which area they fall. However, for machine learning we need large amounts of appropriate data, which we have to crawl first.
For this purpose, we use our own web crawlers, which analyze, for example, project portals, job boards or other relevant online media and create a so-called corpus from their data, i.e. a set of natural linguistic texts that our algorithms can then analyze.
To quickly analyze and search through pools of data and while doing so also, for example, identify minor variations in spelling or synonyms using the ontology, we use efficient, scalable search technologies, such as Elasticsearch, to get the results of our analysis to the end user as quickly as possible.
The technologies we work with include:
- NLP Frameworks: Apache NLP, Stanford NLP and Spacy
- programming languages like: Java, Python
- Machine Learning: Gensim, DeepLearning4J
- Search Engines: Elasticsearch, Apache Solr