Using OCR, CAT and crowdsourcing to translate Classical Arabic works

Posted on Tue 12 January 2021 in Language

I was recently exploring the idea of how to convert Classical Arabic/Islamic works into English using a combination of CAT(Computer Assisted Translation), OCR(Optical Character Recognition) and crowdsourcing students in Arabic studies. The lazy answer for wanting to translate books is to use a combination of OCR and MT(machine-translation) tools(Google/Bing translator) to get an Arabic-to-English translation. This does not work for a few reasons:

  1. Machine-translation is not as good as human translation as explained here
  2. Machine-translation is capable of converting MSA Arabic sentences somewhat accurately, but many Islamic words are not covered by the existing MT tools and will not be accurate for basic sentences.

Below I will discuss the existing challenges and then present my theoretical solution.

The Challenges

Although there are many translations of famous works(like the Quran and the 6 Hadith books), there are also lots of other niche works that are not translated to English. The second shortcoming is that copyright is not strictly respected and many PDF documents are floating around of various Arabic-English translations that presumably do not have the permission of the respective authors. The lack of respect for copyright is a negative for expert translators who spend many thousands of hours going through volumes of a single work and charge a fair price for their translations. These works are then scanned/uploaded without their permission and they lose valuable income that could enable them to create more translations.

Thirdly, many existing translations are only available in physical(printed) books. No digital copies, APIs or even copyable texts exist for many translated works beyond the Quran/Hadith. Some PDF readers are able to copy the English text(even though these are scanned versions of the physical books) but attempting to copy the Arabic text results in malformed text appearing(likely due to left-to-right copying of the text while Arabic is a right-to-left language).

Lastly, the economics of translating niche works does not make many of them feasible for expert translators. A simple example might be a work like al-Jami al-Saghir by Imam Muhammad al-Shaybani. A niche book like this might only appeal to advanced students of Islamic knowledge within the Hanafi school of jurisprudence.

Outlining the Steps

  1. Sourcing PDF versions of Arabic-only Classical works - this will not affect copyright(to the best of my knowledge) as most of the authors lived 1000+ years ago and many of the printed works are just reprints, so even old editions will work
  2. Choosing a source-material for the Translation-Memory of the CAT tool - in this case I think that the Arabic-English from the Hadith books will be the optimal solution.
  3. Appealing to students of Arabic studies to participate in crowdsourcing the translations - this is a win-win as it helps them improve their Arabic translation skills

The Solution

The following is my algorithm for how to achieve the desired outcome. I will mention the software I propose to use in each step below.

PDFs and Scanning - OCR

In order to make the text copyable from PDFs, I propose using 2 pieces of software:

  • Tesseract - this will be the engine that will read the text. Training data can be sourced from here for Arabic
  • PDF to TXT (with OCR) - this is the only listed option for PDF-to-text built on top of Tesseract. Using this script on top of the OCR engine will enable scanning of the sourced PDFs

Translation Management - CAT

The best translation management tool with multi-user support I found is Weblate. It is the best because:

  • It is used by 1000+ other projects for translations
  • It has multi-user support
  • It has built-in version-control
  • It has Translation Memory support
  • It is popular on GitHub(the most Stars)

The source material for the Translation Memory can be obtained from the sunnah.com API

Once the data is loaded into the Translation Memory, this can be applied app-wide and various students can use the CAT tool to translate pieces of content much faster and the content can also be reviewed by others before being approved.

Books, Authors and Works

A starting list of books I found is here: List of Sunni books

This can be a starting point for sourcing/finding various books to translate.

There should be some type of standard for approving which books are to be crowdsourced as translations. I do not have such a standard as of yet but it can be developed as time goes by.

The Goals

The first and primary goal for such a task is to Obtain the Pleasure of God. Although we might shy away from admitting it on technical/personal blogs, deep inside a person's heart, this is always the main goal.

Another important goal for me is to document and develop a system where both the content and the technology behind it is Open Source and accessible to all. I would likely propose that all the crowdsourced translations have a permissive license like the Wikipedia content license so that the translated works are accessible to all(and any edits and improvements can be made via the system that is accessible to students pursuing Arabic studies).