Update1: Using OCR, CAT and crowdsourcing to translate Classical Arabic works

Posted on Wed 28 July 2021 in Language

In my last article on Using OCR, CAT and crowdsourcing to translate Classical Arabic works I spoke about various tools for digitizing Classical Arabic books and then translating them. Since then I have made a series of discoveries and this update is to document those discoveries.

PDFs, OCR and Scanning

Going through the process of finding PDF texts and then using OCR to obtain a digital form of the text is laborious. I was fortunate to discover that my idea for digitizing Arabic texts was already thought about in the mid-00s by others. The project name and website is Shamela and not much will be known to English speakers about it because most Arabic-related discussions happen to be in ... Arabic.

There is a historical record available here. Firstly, the history is Shamela is very unclear until today. Based on the link above, the project is founded by an Egyptian and now funded by an organization in Saudi Arabia. The about page does not say much either. Having a historical record of knowing who is creating the digital versions of Classical Arabic works is important for authenticity. While the Quran is well-preserved(Allah has Promised to preserve it), even the Ahadith experienced attempts at fabrication. The attempts at fabrication are what led to compilation efforts like Sahih Bukhari(even though the oral tradition for Hadith is still relied upon to this day). The PDF on the Ar-Rawda website here provides information about employee names working for Shamela, so we have some basic information. There are 30 people working in the digitizing/verification/data-entry department and 4 programmers working on the software. They all appear to be Egyptians. I will discuss the issue of authenticity a bit more in the next section.

This leads me to the second point, regarding KITAB. KITAB is a project funded by the European Research Council (ERC) and the Aga Khan University(see KITAB homepage). One of their projects is the Open Islamicate Texts Initiative (OpenITI) which is attempting to verify the digital copies provided on Shamela and other collections(Shamela, Shamela-Ibadiyya, Shamela-Shia, Hindawi, Zaydiyya and more). The full list of collections can be found on the OpenITI github page.

Book Verifications(and tagging)

The way OpenITI does verification is confusing to outsiders. Their Google Doc isn't too clear but I was able to gather some basic information from tickets like these on GitHub

Also, their documentation is scattered all over the place and concealed within Google Doc links, making them non-indexable to search engines. Even with these drawbacks, their structure for storing books and book metadata is excellent. The structure is broken down like so:

  • All authors are listed under their date-of-death(Hijri calendar). Example: 0179MalikIbnAnas
  • Within each folder, the various books of the author are listed in their folders. Example: 0179MalikIbnAnas.Muwatta + author information stored in a YAML file. Example: 0179MalikIbnAnas.yml
  • Within each book folder will be the various books. Their format is like so: date_of_death+author_name+book_name+source+source_URL+language. An example is: 0179MalikIbnAnas.Muwatta.Shamela0001699-ara1 . The source URL is: https://shamela.ws/index.php/book/1699
  • Each version of the book has its own YAML file. Example: 0179MalikIbnAnas.Muwatta.Shamela0001699-ara1.yml
  • Then there is another file called: 0179MalikIbnAnas.Muwatta.yml . This might be some metadata related to the book, but it isn't completely clear

The process for their tagging/verification seems to follow these steps:

  1. Finding an online PDF that is the same version as the source-text
  2. Finding a hard-copy reference of that book on WorldCat
  3. Doing some text-tagging and annotation for their own flavour of Markdown(which requires a closed-source Windows-only editor to use)
  4. Reviewing this tagging/annotation

Authenticity - who checks the scanners?

I found the process of verifying the Shamela(and other) books by using digitally-scanned copies of their hard-copy counterparts to be circular. The reason being that the point of failure lies in the anonymous scanners of the hard-copy books.

Why is this a point of failure?

To answer this, I provide a series of questions:

  • Who scanned the book?
  • When did the scans take place?
  • How certain is it that the claimed scanned version is actually the same as its hard-copy counterpart?

I contacted the Shamela team via email to understand how they digitized books. Perhaps they used hard-copy books and did their own scanning. I asked the following questions to them:

My query to you is about how you authenticate or verify the books that you are digitizing?

What method do you follow? Do you keep a digital PDF and a hard copy of each book?

There(rather short) response was:

Yes. In books we digitize, we keep a pdf and text copies. No hard copies here

I sent a follow-up email querying the hard-copy problem and received the same response as above, albeit in larger font. This all but confirmed that Shamela are also just using scanned copies of books found on the internet and leads me to the next point ...

The reason the OpenITI team are able to verify books according to version is because the Shamela team are essentially just finding these scanned copies online themselves and then digitizing it. Hence the circular problem. The text and digitally-scanned copy will match, but it doesn't seem like OpenITI or even Shamela are asking: do the digitally-scanned copies match their hard-copy versions?

Connecting the physical to the digital

While it may be daunting, verifying the hard-copies against the digitally-scanned ones would close the loop on authenticity. The OpenITI team is taking care of making sure there are digitally-scanned copies that match up to the text, what remains is to connect the digitally-scanned copy to its hard-copy counterpart.

A simple method of verifying(and reaping rewards for practicing the Sunnah) would be to use an odd number (other reference like 3 and finding 3 randomly-chosen points in the text, then comparing these to the digital-scan and then to the hard-copy.

I initially considered the number 33, but besides tasbih-fatimi, I found no significance to that number specifically(although it is an odd number too).

CAT tools - OmegaT, AI/ML and ethics

Regarding CAT tools, Weblate is not a suitable option. It is geared towards Open Source projects and not books. I spent a lot of(perhaps pointless) time looking at the various CAT tools available.

I did learn a lot about CAT-related software:

  • Various types of CAT tools exist, some using only human-driven data and some using a mix of Machine Translations(MT) that are then edited by humans
  • Human-driven data is stored in Translation Memories(called TMs), which are also now being used for Machine Learning in something called: Adaptive MT
  • The industry has had its own 'cloud' moment in the form of Matecat and others that followed, which are able to benefit from the massive corpora created by users
  • Adding this "Adaptive MT" was something a lot of the proprietary players did to either keep up with the competition or to just add marketing spin to their products(I didn't find any studies to indicate its effectiveness for translators, although it is marketed as such)
  • There is some strange and possibly unethical funding model with some of these Open Source projects(which I will discuss below)
  • There is only 1 major Open Source tool in comparison to the proprietary ones and it is called: OmegaT

The situation behind what I would personally consider unethical(and not proven in a court of law) is the way in which some Open Source projects were funded and how their business models evolved. What I discovered was that both MateCat and ModernMT were partially or fully-funded by the EU and research grants, but these projects evolved into subscription-services. My personal view is that using research/grant money intended to build out an Open Source product and then 'forking' it into your own paid-for proprietary cloud service once a proven business model exists is somewhat unethical(evidenced by ModernMT's roadmap that essentially puts the Open Source project into maintenance mode).

The idea behind Adaptive MT sounds really cool to have for an undertaking like translating Classical Arabic works, but I reasoned that it won't be of much use because there isn't much existing data to train the MT(via TMs or Translation Memories).

I previously discussed that 1 of the goals of the translation was to crowdsource the effort so that lots of content can be translated simultaneously. The advantages of web-based tools over OmegaT are the minimal install requirements and more polished UIs. On the other hand, offline collaboration becomes possible with software like OmegaT. The Team Project also requires a server to be setup, but this would be the case for a web-based tool as well.

Creating an umbrella for the different projects

I am busy formulating an outline for the different projects and I will place them under an umbrella website. Having a structure is important for being able to know what the goals are and then working towards them with small iterations.

I will announce the umbrella website name and the structure in the next update.


If you don't know how to use RSS and want email updates on my new content, consider Joining my Newsletter

The original content of this blog is a Waqf solely for the Pleasure of Allah. You are hereby granted full permission to copy, download, distribute, publish and share this content without modification under condition that full attribution is given to this author by creating a link either above or below the content that links back to the original source of the content. For any questions or ambiguity, you are requested to contact me via email for clarification.