MYSHBAH.AI — مشباح | Bridging Arabic Heritage & Artificial Intelligence

مشباح

MYSHBAH · AI

Bridging Arabic Heritage & Artificial Intelligence

The first academically verified Arabic heritage dataset company. Pre-Islamic and early Islamic classical Arabic content — structured, annotated, and authenticated by Arab scholars — for the sovereign Arabic AI programmes building the next generation.

Request a Scoping Conversation How It Works

~0.5% Arabic on the Web · Fanar 2.0

~18% Native Arabic Tokens · Jais

~41% Arabic Tokens · Fanar 1.0

13% Arabic Books · ALLaM

~0% Pre-Islamic Heritage · Any Model

The Problem

Arabic AI has never seen the language at its roots

The pre-Islamic Jahiliyya era produced the foundational texts of the Arabic language — the Mu'allaqat, the tribal histories, the poetry that preserved an entire civilisation's memory for over fourteen centuries. Arab scholars have verified, annotated, and taught this heritage for generations.

Arabic AI has never encountered it. The knowledge exists — it sits in university libraries, doctoral theses, and annotated manuscripts that only specialists know how to find. MYSHBAH.AI was built to build the bridge from the shelf to the system.

"The question is not whether AI will mediate the world's relationship with Arabic heritage. It already does. The question is whether that mediation will be grounded in verified scholarly knowledge — or in Wikipedia and the contents of the internet."

Targeted Demonstration — Five AI Systems · Four Questions

Five AI systems tested in Arabic on the War of al-Basus — no prior context — including two purpose-built Arabic LLMs — 2026

Claude Could not identify the camel's owner. Could not reproduce the Verses of Death. Admitted ignorance — the content does not exist in its training data.

Copilot Misattributed the camel. Fabricated a verse and presented it as authentic classical text.

Gemini Misattributed the camel. Disclosed its source: Wikipedia and Facebook. Illustrated historical figures with actors from a modern television drama.

Fanar 2.0 ★ Misattributed the camel and confused the woman's name with the camel's name. Inverted the war's causation entirely. The Arabic-centric model performed worse than the general-purpose systems.

Jais v2 ★ Misattributed the camel. Fabricated martial verses in the wrong genre. Contradicted itself on the most basic relationship in the story within a single response.

A conflict taught in Arab school curricula — misrepresented by every AI system tested, including those built specifically for Arabic.

★ Purpose-built Arabic-centric LLM

The Solution

Verified datasets — civilisational memory, machine-readable

A verified dataset is not a technology product. It is a scholarly judgement, made permanent. When a professor of classical Arabic confirms that a root attribution is correct, that confirmation becomes part of what Arabic AI will learn. The scholar's name, institution, and published expertise are embedded in every licensing agreement.

The mishkah (المشكاة) holds the light. The misbah (المصباح) is the light. The scholar is not a service provider in this framework — the scholar is the product.

"The dataset does not exist without the scholar."

Step 01

Verified Arab Academic Sources

Content drawn from Arab academic critical editions, research center publications, and university faculty research and doctoral theses across the Arab world — provided each source carries verifiable academic attribution. No machine-translated content.

Step 02

Automated First-Pass Annotation

Each text is computationally segmented and morphologically tagged at scale — producing a structured draft ready for expert review.

Step 03

Three-Scholar Human Verification

Morphology, cultural context, and literary structure — each dimension verified by a named expert with published academic credentials.

Step 04

Delivery — Base Format & Pre-embedded

Standard structured data format for any AI pipeline, or pre-embedded for immediate deployment in Azure AI Search and other vector databases.

The Three-Scholar Model

One text. Three dimensions of verified knowledge.

Morphology & Diacritics

Classical Arabic morphology and pre-Islamic prosody. Root attribution, diacritical rendering, grammatical tagging, and disambiguation of ambiguous classical forms.

Cultural & Tribal Context

Pre-Islamic Arabian history and Jahiliyya social structures. Tribal references, honour codes, historical allusions, and the social meaning of specific lexical choices.

Literary Structure

Classical Arabic poetry and the Mu'allaqat canon. Metre, rhyme, intertextual references, thematic classification, and verse-level literary significance.

Research Agenda

The gap is not just in the data — it is in how we measure what is missing

The Benchmark Problem

Existing Arabic LLM benchmarks — including ArabicMMLU and ACVA — measure factual recall, mathematical reasoning, and dialectal comprehension. No shared standard exists for civilizational awareness: the capacity to represent the pre-Islamic and classical foundations upon which the Arabic language was built. These benchmarks represent real progress. They do not measure the foundational layer.

The Digitization Imperative

Decades of verified classical Arabic scholarship sit in university archives across the Arab world, untouched by any AI training pipeline. Arabic represents approximately 0.5% of indexed web content despite 400 million native speakers. The material exists. The pipeline does not. Nationally sponsored digitization programmes, developed in partnership between Arab universities, language research centers, and national archives, represent the most direct path forward.

The Partnership Call

A properly scaled benchmark for classical Arabic heritage knowledge requires hundreds of questions spanning multiple episodes, poetic genres, and historical periods — designed through coordinated collaboration between classical Arabic scholars, Arabic NLP researchers, and AI evaluation teams. MYSHBAH.AI is building the data infrastructure that makes this possible. We call on Arab universities, language research centers, and Arabic LLM developers to build it together.

Built For

Three pathways to Arabic heritage intelligence

نموذج

Sovereign Arabic LLM Programmes

Training and RAG deployment rights for national Arabic AI models. Verified classical Arabic heritage content that measurably improves cultural and historical knowledge performance.

بحث

RAG-Enabled Enterprise AI

Annual deployment licences for organisations using Azure OpenAI or similar services who need a verified Arabic heritage knowledge base without training their own model.

تراث

Academic & Cultural Institutions

Research partnerships with Arab universities and cultural institutions. Scholar attribution, co-authorship on the Arabic Heritage benchmark study, and archive collaboration.

The Pilot

'Antara ibn Shaddad's Mu'allaqa — the opening commission

The pilot corpus is 'Antara ibn Shaddad's Mu'allaqa — one of the seven canonical pre-Islamic odes, among the most studied texts in Arabic literary scholarship, and the centrepiece of Arabic literature curricula across the Arab world.

This is not a test of whether the content matters. Every Arabic AI that encounters a question about Antara, pre-Islamic Arabian tribal culture, or the Mu'allaqat tradition will demonstrate the gap immediately. The pilot demonstrates that MYSHBAH.AI can close it.

"The difference between an AI that has read the poem and an AI that has studied under its greatest living interpreter."

What the Pilot Delivers

—Verified structured corpus of the full Mu'allaqa — morphologically annotated, culturally contextualised, three-scholar authenticated
—Pre-embedded version compatible with Azure AI Search and major vector databases for immediate deployment
—Academic verification certificate with named scholars, institutional affiliations, and methodology documentation
—Live before/after demonstration: model responses with and without the verified corpus
—Full schema documentation ready to scale to the Seven Mu'allaqat and the Antara subject dataset
—A clear pathway to the Seven Mu'allaqat corpus and extended subject datasets — available to scope upon completion of the pilot

Scope

Scoped to your programme's requirements — corpus size, annotation depth, and deployment format.

Request a Scoping Conversation

Academic Partnership · Currently in Formation

For scholars of Arabic heritage — an invitation

There was a time when Arabic was to the world what English is today. From the 8th to the 13th century, to be a scholar anywhere between the Atlantic and the Indian Ocean was to read and write in Arabic. Persian scientists, Turkish administrators, and Jewish and Christian scholars in Andalusia all conducted their intellectual lives in Arabic — because Arabic was where human knowledge lived.

That civilisation did not begin with Islam. It began before it — in the poetry and oral histories of the Jahiliyya era. The scholars who have spent careers in this heritage are the only ones who can ensure that Arabic AI learns it correctly.

We invite scholars to:

—Verify a defined classical Arabic heritage corpus — with permanent attribution on every licensing agreement
—Contribute as a named co-author on the benchmark evaluation study — the next stage of MYSHBAH.AI's research programme on classical Arabic heritage knowledge in AI
—Open archive materials — manuscripts, unpublished annotated editions, rare texts — for structured digital preservation

The Arabic Heritage Data Report

MYSHBAH.AI has produced the first cross-model evaluation of the pre-Islamic Arabic heritage data gap in large language models — five AI systems including two purpose-built Arabic LLMs, all failing consistently on foundational pre-Islamic knowledge. Academic partnerships with Arab universities, language research centers, and NLP institutions are currently in formation. Scholars joining now will be primary named authors on the benchmark study that follows.

Academic Partnership Enquiry