CLARIN:EL Research Infrastructure implements, provides, maintains and supports language processing tools and web services. Through the CLARIN:EL Central Inventory users can have access and use all the available language processing web services such as word analysis tools, word recognition tools, sentence splitting tools and part of speech tagging tools, morphological and syntactic analysis tools, named entity recognition tools, term extraction tools, sentiment analysis tools etc.

Language processing tools and web services work in the following way: they receive as input a text (compatible with specifications described in detail), process it (depending on the competence of each of them) and output the processed result (annotated text).

Access to the metadata descriptions of all the available language processing web services and tools is open to all (i.e. registered and unregistered users) through the CLARIN:EL Central Inventory. In addition, CLARIN:EL registered users have further access to make use of the language processing web services, always in accordance to the relevant licensing terms.

 

CLARIN:EL Language Processing tools & web services

swatchbook

Tokenization

Tokenization is used for detecting words and phrases in texts. More specifically, these tools are used for splitting textual content (e.g. a document) into smaller units, such as sentences, words, punctuation marks, numbers or symbols. These units are called tokens.

Available tools & web services

ILSP Sentence splitter and Tokenizer for Greek

HTokenizer

OpenNLP Tokenizer (English)

OpenNLP Tokenizer (German)

OpenNLP Tokenizer (Portuguese)

boxes

Lemmatization

Groups together different infleced types of a word, called lemma. The output of lemmatization is a proper word. Fore example, a lemmatizer should map gone, going and went into go.

Available tools & web services

ILSP Lemmatizer

laptop

PoS Tagging

PoS Tagging is used for annotating every word of a text with the corresponding part of speech tag (e.g. noun, verb, adjective, adverb, etc.) based on its context and definition. The result is a POS tag assigned to each token of the text.

Available tools & web services

ILSP Feature-based multi-tiered POS Tagger

OpenNLP Part-of-Speech Tagger (English)

OpenNLP Part-of-Speech Tagger (German)

OpenNLP Part-of-Speech Tagger (Portuguese)

tag

Named Entity Recognition

Named Entity Recognition is used in various information extraction applications for the automatic recognition and classification of Named Entities in texts into predifined classes such as: Person, Location, Organization, GPE (Geo-political entity). The result is a tag with the corresponding category for each named entity identified in the text(s).

Available tools & web services

GrNE-Tagger (Greek)

OpenNLP Name Finder (English)

magic

Sentence Splitting

Sentence splitting is used for detecting sentences in texts. More specifically, these tools identify the boundaries of a sentence by making use of punctuation marks and further detecting whether they mark the end of a sentence or not.

Available tools & web services

ILSP Sentence splitter and Tokenizer for Greek

OpenNLP Sentence Detector (English)

OpenNLP Sentence Detector (German)

OpenNLP Sentence Detector (Portuguese)

list

Dependency Parsing

Dependency parsers create tree representations for each input sentence, where each word depends on a head word and is assigned a label depicting its relation to the head word (e.g. subject, object, etc.). Thus, in the sentence Astronomers discovered a new moon, a dependency parser recognizes that the words Astronomers and moon are the subject and the object of the word discovered.

Available tools & web services

ILSP Dependency parser

robot

Chunking

Chunking involves the identification and segmentation of a text into groups of words, which are related to each other at the syntactic level, such as nominal groups or verbal groups, without further specifying their internal structure or their syntactic role in the sentence.

Available tools & web services

HNPChunker (Greek)

OpenNLP Chunker (English)

edit

Manual Text Annotation

Annotation is the practice of adding interpretative linguistic information, known also as tags and/or labels, to words, or sets of words of a text or a corpus. Annotation can be done both in raw data as well as in data that have already been processed. Annotation can be done automatically (see all previous tools and web services) or manually, by human annotators.

Available platforms for collaborative text annotation

Hypothesis

WebAnno

Inception