Skip to main content

Go to Central Inventory

  • EL
  • EN
Home
  • About
    • What is CLARIN:EL
    • What CLARIN:EL offers
    • CLARIN:EL network
    • The team
  • Join
  • Services
    • Find
    • Process
    • Share
    • Documenting LR's
  • Support
    • User guides
    • Legal & policy issues
    • Helpdesks
    • FAQs
  • Documentation
    • Publications
    • Presentations
    • Media Gallery
  • News & Events
    • Latest news
    • Events
    • Newsletter
    • Dissemination events
  • K-Centre
    • About K-Centre
    • Knowledge
    • Community
    • Helpdesk

From Japan to CLARIN:EL

Breadcrumb

  1. Home
  2. From Japan to CLARIN:EL
07
07
2022

From Japan to CLARIN:EL

Family:
Tatoeba

Tatoeba is a collection of sentences and translations. It's collaborative, open, free and even addictive!

The summer months of June, July and August are dedicated to language resources which are hosted in the CLARIN:EL Infrastructure and constitute members of the Tatoeba Resource Family.

Trang Ho, front-end developer, inspirator and founder of the original idea of Tatoeba, wanted to help those trying to learn a foreign language. So, she built a database of sentences with their translations in different languages and a search engine. The user can search for particular words or phrases and see how such words or phrases are used in the context of a sentence as well as their translations in the desired languages. Capturing this concept, the database was given the name Tatoeba which means for example in Japanese!

The idea is not original – similar guides have been around for many years, first in paper format and later on in digital format. The originality in this case is the collaborative enrichment of the database (by crowdsourcing). The first database was created in 2006, hosted on Sourceforge under the codename of multilangdict. Since then, it has been continuously enriched by the voluntary contributions of thousands of members, who add new material to existing languages, as well as translations of words and phrases into new languages. 

The text corpus extracted from the Τatoeba database and hosted in the CLARIN:EL Infrastructure is the 2015 edition. It contains sentences in 117 languages, from English and Greek to Scottish Gaelic, and a total of about 13 million tokens. Through the CLARIN:EL Infrastructure the user can also find access to subsets of this corpus, in combinations of English, German and Portuguese to Greek, in two different formats (TMX and Moses). All the resources of the Taoteba family are freely available for research purposes through CLARIN:EL under a CC-BY-NC-ND License of Use (Attribution, Non-Commercial use, No Derivatives). The format of the material allows mainly its processing by natural language processing tools/services rather than its reading by human users. Anyone interested, however, in viewing or reading the multilingual material can visit the web interface of Tatoeba.

Today, the web interface of Tatoeba contains more than 10 million sentences in 417 languages (even in Klingon, the language of Star Trek!), with new examples being added every day. As of April 2022, audio material has been added as well: almost 1 million sentences in 38 languages are accompanied by audio files with recordings of words and phrases by native speakers, giving users the correct pronunciation. 

If you want to, you can start contributing your own sentences and translations to Tatoeba!

In the Preview box below you can see an example of a phrase in XML format, as it is available for download through the CLARIN:EL Infrastructure, as well as the same phrase as it appears in the Tatoeba web interface accompanied by its available translations in different languages.

Resource information

Resource name:

Tatoeba

Repository:

Athena Research Centre

Resource type:
Corpus
Media type:
Text
Language:

Portuguese, German, English, Greek (+ 113 more languages)

Size:
12.790.000 tokens
Format:
XML

Preview

Preview of the resource
Other members of the same family
Discover the 6 resources of the same family (subsets of the presented corpus) in the Institutional Repository of the Athena Research Centre:
  • Tatoeba subcorpus DE-EL (Moses)
  • Tatoeba subcorpus DE-EL (TMX)
  • Tatoeba subcorpus EL-PT (Moses)
  • Tatoeba subcorpus EL-PT (TMX)
  • Tatoeba subcorpus EN-EL (Moses)
  • Tatoeba subcorpus EN-EL (TMX)
  • facebook
  • twitter
  • youtube

Footer menu

  • About
  • FAQS

creative-commons Site content licensed under CC BY-NC-SA 4.0

© CLARIN:EL 2025   Terms of service | Privacy Policy