From Japan to CLARIN:EL
The summer months of June, July and August are dedicated to language resources which are hosted in the CLARIN:EL Infrastructure and constitute members of the Tatoeba Resource Family.
Trang Ho, front-end developer, inspirator and founder of the original idea of Tatoeba, wanted to help those trying to learn a foreign language. So, she built a database of sentences with their translations in different languages and a search engine. The user can search for particular words or phrases and see how such words or phrases are used in the context of a sentence as well as their translations in the desired languages. Capturing this concept, the database was given the name Tatoeba which means for example in Japanese!
The idea is not original – similar guides have been around for many years, first in paper format and later on in digital format. The originality in this case is the collaborative enrichment of the database (by crowdsourcing). The first database was created in 2006, hosted on Sourceforge under the codename of multilangdict. Since then, it has been continuously enriched by the voluntary contributions of thousands of members, who add new material to existing languages, as well as translations of words and phrases into new languages.
The text corpus extracted from the Τatoeba database and hosted in the CLARIN:EL Infrastructure is the 2015 edition. It contains sentences in 117 languages, from English and Greek to Scottish Gaelic, and a total of about 13 million tokens. Through the CLARIN:EL Infrastructure the user can also find access to subsets of this corpus, in combinations of English, German and Portuguese to Greek, in two different formats (TMX and Moses). All the resources of the Taoteba family are freely available for research purposes through CLARIN:EL under a CC-BY-NC-ND License of Use (Attribution, Non-Commercial use, No Derivatives). The format of the material allows mainly its processing by natural language processing tools/services rather than its reading by human users. Anyone interested, however, in viewing or reading the multilingual material can visit the web interface of Tatoeba.
Today, the web interface of Tatoeba contains more than 10 million sentences in 417 languages (even in Klingon, the language of Star Trek!), with new examples being added every day. As of April 2022, audio material has been added as well: almost 1 million sentences in 38 languages are accompanied by audio files with recordings of words and phrases by native speakers, giving users the correct pronunciation.
If you want to, you can start contributing your own sentences and translations to Tatoeba!
In the Preview box below you can see an example of a phrase in XML format, as it is available for download through the CLARIN:EL Infrastructure, as well as the same phrase as it appears in the Tatoeba web interface accompanied by its available translations in different languages.
Resource information
Portuguese, German, English, Greek (+ 113 more languages)
Preview
