From GDPR to CLARIN:EL

2023

From GDPR to CLARIN:EL

Family:

Named Entity Recognition Tools

How can Natural Language Processing help in personal data protection?

To celebrate European Data Protection Day (28 January), we dedicate this month to resources which are hosted in the CLARIN:EL Infrastructure and constitute members of the Named Entity Recognition Tools Resource Family.

In 2016, the European Parliament adopted a set of provisions for the protection of personal data (General Data Protection Regulation/GDPR). Such data are the first and last name of an individual, his/her residential address as well as his/her email address. The identification of an individual can be achieved through these data. One way of ensuring the protection of personal data is anonymisation and pseudo-anonymisation, i.e. hiding or modifying the values of such data. The first step is of course to identify these data. This process can be achieved quickly and accurately by using natural language processing (NLP) tools that, among other things, identify named entities, i.e. names of individuals and other entities such as places, organisations, geopolitical organisations and even artwork titles.

Prokopis Prokopidis refers to technologies developed at the Institute for Language and Speech Processing which are related to anonymisation processes:

In recent years, there has been an increasing demand for accessibility to texts of various kinds as well as for the protection of personal data contained in these texts. For example, the anonymisation of judicial decisions can help to ensure that these decisions are both compatible with the GDPR and national legislation terms and accessible to any interested citizen.

ILSP/Athena RC develops automatic anonymisation solutions incorporating the Institute's NLP tools. One set of such NLP tools, based on cutting-edge technologies such as deep learning, is the ILSP Neural NLP Toolkit¹.The toolkit integrates modules for text segmentation, POS tagging, lemmatization, dependency parsing, text classification and named entity recognition.

In the context of Named Entity (NE) recognition, two complementary approaches are used. The first approach is based on a NE recognition model trained with deep learning techniques on manually annotated data², according to an annotation scheme comprising of 18 NE categories such as PERSON (full name of an individual), ORG (organisations and companies), GPE/LOC/FAC (cities, countries, locations, names of buildings), etc. Τhe accuracy of NE recognition for the category of PERSON is 95%. The second approach relies on regular expressions rules for the identification of specific phrases related to street addresses, telephone numbers, IBANs, social security numbers, VAT numbers, court decisions, laws, e-mail accounts, URLs, etc.

ILSP implements anonymization solutions using automatic annotations generated from the toolkit and other NLP tools. In particular, automatic annotations from the stages of lemmatization and named entity recognition are taken into account. Thus, in court judgment texts, full names, previous judgments and addresses are annotated as candidate phrases for replacement with abbreviated forms and/or suspension points. Also, in certain contexts, the full names which correspond to specific categories of actors of a court case are not annotated. This occurs when they appear in appropriate contexts, for example, after words such as attorney and associate judge, which are automatically being lemmatized.

ILSP Neural NLP Toolkit is freely available for research purposes through CLARIN:EL under CLARIN:EL Terms of Service Licence. CLARIN:EL registered users can use it as a service as well through the CLARIN:EL Workflow Registry.

¹P. Prokopidis and S. Piperidis. 2020. A Neural NLP toolkit for Greek. In Proceedings of 11th Hellenic Conference on Artificial Intelligence.

²Bartziokas, Mavropoulos, and Kotropoulos. 2020. Datasets and Performance Metrics for Greek Named Entity Recognition. In Proceedings of the 11th Hellenic Conference on Artificial Intelligence.