CLARIN:EL Research Infrastructure provides language processing tools and web services. Through the CLARIN:EL Central Inventory users can have access and use all the available language processing web services such as word analysis tools, word recognition tools, sentence splitting tools and part of speech tagging tools, morphological and syntactic analysis tools, named entity recognition tools, term extraction tools, etc.

Language processing tools and web services work in the following way: they receive text as input (compatible with specifications described in detail), process it (depending on the competence of each of them) and output the processed result (annotated text).

Access to the metadata descriptions of all the available language processing web services and tools is open to all (i.e. registered and unregistered users) through the CLARIN:EL Central Inventory. In addition, CLARIN:EL registered users have further access to make use of the language processing web services, always in accordance to the relevant licensing terms.

 

CLARIN:EL Language Processing tools & web services

swatchbook

Tokenization

Tokenization is used for detecting words and phrases in texts. More specifically, these tools are used for splitting textual content (e.g. a document) into smaller units, such as sentences, words, punctuation marks, numbers or symbols. These units are called tokens.

Available tools & web services

ILSP Sentence splitter and Tokenizer for Greek

ILSP Neural NLP Toolkit

HTokenizer

OpenNLP Tokenizer (English)

OpenNLP Tokenizer (German)

OpenNLP Tokenizer (Portuguese)

boxes

Lemmatization

Groups together different infleced types of a word, called lemma. The output of lemmatization is a proper word. Fore example, a lemmatizer should map gone, going and went into go.

Available tools & web services

ILSP Lemmatizer

ILSP Neural NLP Toolkit

laptop

PoS Tagging

PoS Tagging is used for annotating every word of a text with the corresponding part of speech tag (e.g. noun, verb, adjective, adverb, etc.) based on its context and definition. The result is a PoS tag assigned to each token of the text.

Available tools & web services

ILSP Feature-based multi-tiered PoS Tagger

ILSP Neural NLP Toolkit

OpenNLP Part-of-Speech Tagger (English)

OpenNLP Part-of-Speech Tagger (German)

OpenNLP Part-of-Speech Tagger (Portuguese)

tag

Named Entity Recognition

Named Entity Recognition is used in various information extraction applications for the automatic recognition and classification of Named Entities in texts into predifined classes such as: Person, Location, Organization, GPE (Geo-political entity). The result is a tag with the corresponding category for each named entity identified in the text(s).

Available tools & web services

GrNE-Tagger (Greek)

ILSP Neural NLP Toolkit

OpenNLP Name Finder (English)

magic

Sentence Splitting

Sentence splitting is used for detecting sentences in texts. More specifically, these tools identify the boundaries of a sentence by making use of punctuation marks and further detecting whether they mark the end of a sentence or not.

Available tools & web services

ILSP Sentence splitter and Tokenizer for Greek

ILSP Neural NLP Toolkit

OpenNLP Sentence Detector (English)

OpenNLP Sentence Detector (German)

OpenNLP Sentence Detector (Portuguese)

list

Dependency Parsing

Dependency parsers create tree representations for each input sentence, where each word depends on a head word and is assigned a label depicting its relation to the head word (e.g. verb - subject, object, etc.). Thus, in the sentence Astronomers discovered a new moon, a dependency parser recognizes that the words Astronomers and moon are the subject and the object of the word discovered.

Available tools & web services

ILSP Dependency parser

ILSP Neural NLP Toolkit

robot

Chunking

Chunking involves the identification and segmentation of a text into groups of words, which are related to each other at the syntactic level, such as nominal groups or verbal groups, without further specifying their internal structure or their syntactic role in the sentence.

Available tools & web services

ILSP Neural NLP Toolkit

HNPChunker (Greek)

OpenNLP Chunker (English)

edit

Manual & semi-automatic Text Annotation

Annotation is the practice of adding interpretative linguistic information, known also as tags and/or labels, to words, or sets of words of a text or a corpus. Annotation can be done both in raw data as well as in data that have already been processed. Annotation can be done automatically (see all previous tools and web services), manually, by human annotators (see Hypothesis, WebAnno & Inception), or semi-automatically, by semi-automatic processes included in the tool (see WebAnno & Inception).

Available platforms for collaborative text annotation

Hypothesis

WebAnno

Inception

bars

Text classification

Text Classification is the task of assigning a set of predefined categories (e.g.  label or class) to a given text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, news articles, medical studies and files. For example, news articles can be organized by thematic categories (e.g. economy, society, sports, politics, etc.). Text classification is one of the fundamental tasks in NLP based on either machine learning and/or rule-based techniques with broad applications such as topic labeling, sentiment analysis, spam detection, and intent detection.

Available tools & web services

ILSP Neural NLP Toolkit

person-harassing

Verbal Aggression Analysis

Verbal Aggression Analyzers are language technology based tools that perform automatic detection and classification of specific types of verbal attacks (e.g. criticism, swearing, irony, ousting, physical abuse, etc.) expressed against specific targets.

Available tools & web services

Verbal Aggression Analysis on Greek Twitter

Twitter Verbal Aggression Analysis (English)

 

 

You can find more information and detailed instructions on the processing of language resources in the CLARIN:EL User Manual here.

See also: