Public sector's data used as language resources

According to the E-Government Survey 2016 report, published by the United Nations Department of Economic and Social Affairs (DESA) and the Division for Public Administration and Development Management (DPADM), open data can help to improve accountability and transparency. In addition to the improvement of public administration control procedures, one of the major conclusions of this report is the following:

"Making data available online for free also allows the public – and various civil society organizations –to reuse and remix them for any purpose. This can potentially lead to innovation and new or improved services, new understanding and ideas." (p. 24)

The reuse of the public sector’s open data usually focuses on the content. However, in the case of Language Technology, reuse of data concentrates on the language material itself, treating data as Language Resources.

Language Resources are used as training or testing/evaluation data to train and improve language processing systems. Furthermore, Language Resources are used to create terminological dictionaries or to enrich general language dictionaries with new words, examples of use etc. More specifically, bilingual resources are valuable for training Machine Translation systems, improve their quality as well as enable the systems to adapt to new domains and different registers.

On that basis, the EU decided to deploy large volumes of language data produced daily by all Member States in order to train its Automated Translation platform. Based on the European Commission’s Machine Translation service, MT@EC, which is in operation since 2013 and currently available to EU institutions and Member States’ public services, the Automated Translation platform is used to produce translations from and into any official EU language. Statistical machine translation (SMT) constitutes the technology behind the platform which learns to translate from existing human translations. In short, appropriate language resources are required for the training of the system.

Thus, in April 2015 the European Commission launched the of European Language Resource Coordination programme, in order to identify and gather language data produced by national public services and governmental institutions across 30 European countries. All data gathered in the ELRC initiative will be provided to the European Commission for use in the Automated Translation Platform.

"Athena" RC has set up the Repository of ELRC-SHARE, a simplified version of the structure of CLARIN:EL repositories, for the collection, documentation and searching/browsing of the language resources.

More information about the European Language Resource Coordination can be found here.