Tarkeeb - the concept of sentence analysis(or Linguistic Annotation) in Arabic

Posted on Fri 05 March 2021 in Language

During my time studying intermediate Arabic, mainly the study of Al-Ajurumiyah[1] in Arabic grammar, I was introduced to the word 'Tarkeeb'. I was initially trying to grasp the concept of what Tarkeeb means, among the mountain of Arabic terms used for different concepts in the language(parts-of-speech, grammatical concepts, morphological concepts and more), until I realized that I have already been exposed to Tarkeeb via The Quran Corpus.

Defining Tarkeeb

Tarkeeb is basically the equivalent of what is shown in the Quran Corpus Treebank. I did find a definition from a PDF(the PDF is the only definition or discussion of 'Tarkeeb' I found on the internet that relates to the concept of Linguistic Annotation. I have also removed the link to the PDF for various reasons) but I will copy/paste the definition due to the lack of preservation of digitized works among the many hundreds of Islamic/Arabic websites that depend on 3rd-party services(like the hosted sites on wordpress.com) that get shut down after a few years for various reasons.

The definition says:

Tarkeeb, in the English language, could be best translated as “Sentence Parsing”; though, to explain the concept of it to an English speaker may prove difficult,as neither does English nor –to the best of our knowledge –any other language havesuch a component as “Tarkeeb”. That is, the critical analysis of speech and text; breaking itdown sentence by sentence, and analysing those sentences, analysing each and every word in the sentence, tracing them back to their root forms, understanding each and every word individually, its role in the sentence, why it was inserted, what effect it has on the word(s) before it and the word(s) after it, and thereafter joining that sentence together, piece by piece, like a jigsaw puzzle, after having dissected and fully understood it.

English does have what they refer to as “Sentence Parsing”, but this can never be compared with “Tarkeeb” in Arabic.Also, “sentence parsing”, as a subject taught formally in schools died out a long time ago. And again, that is besides the fact that Tarkeeb is incomparably more advanced and sophisticated as compared to “Sentence Parsing” in English. Nevertheless, English speakers whohad studied sentence parsing should then at least have a vague idea of what Tarkeeb is about.

The reason the author mentions the following:

as neither does English nor –to the best of our knowledge –any other language havesuch a component as “Tarkeeb”

Is due to a drawback in how this field is itself defined. A layperson would not be able to explain the idea behind what Tarkeeb is without digging deeply into the academic areas of language. I happened to do so myself and uncovered that the equivalent of Tarkeeb in other languages is: Linguistic annotation. The field is so broad that there is even a generalized framework for Linguistic Annotation called Universal Dependencies

Here is the explanation of UD:

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.

How I discovered Linguistic Annotation

As I progress in my learning of Classical Arabic, the geek in me has been exploring various software related to different areas of language(and Arabic). Contrary to what the Quran Corpus offers on its download page, it is very much a closed-source project that cannot be hosted locally. I wanted to host my own instance of the software in order to experiment with applying Linguistic Annotation to other Islamic texts like Hadith(a future full article will be dedicated to explaining this fully). Through a major yak-shave, I uncovered both a more accurate definition(see below) and label for the concept(Linguistic Annotation) and a bunch of Tools for Corpus Linguistics

A clearer definition Tarkeeb

Tarkeeb would basically be:

The annotation of the(textual) Arabic language with various concepts related to understanding language(like parts-of-speech, morphological features and dependency-graphs)

A picture is worth a thousand words, so here is a borrowed image from the Quran Corpus:

Surah Ikhlas

My definition is probably inaccurate to academic-linguists, but for laypeople it is easier to understand. Case in point is the Wikipedia explanation(linked above), which is too burdensome for laypeople to understand without having to know the various domain-specific definitions that only academic-linguists would.

Annotation software

I shared a link above to various Open Source Corpus Linguistics tools, but the ones that would be best suited for Tarkeeb are the annotation ones. I tried a few of them that were closest to what the Quran Corpus has in structure, like Flat. I was biased towards Python options as those would be the easiest for me to modify if I needed code-modifications. After some deeper investigation I found INCEpTION via WebAnno. INCEpTION is really amazing and has practically everything I need for a possible clone of the Quran Corpus. In a future article I will discuss how to install and run INCEpTION locally.

Many of the advanced Open Source language-annotation projects are written in Java. The complexity of some of these projects is profound and I have a new-found respect for the often-derided language(which I am still not a fan of due to its C-like syntax and difficulty to grasp when compared to Python).


If you have any thoughts on Tarkeeb, learning Arabic, Open Source software(related to what I discussed above), drop me an email and we can share our thoughts on the subject.

[1] Al-Ajurumiyah is a famous Arabic Grammar book written by Abu `Abdillah Muḥammad b. Muḥammad b. Dāwūd as-Sinhājī