This morning when you woke up, you might have asked your smartphone assistant about the weather or sought advice from ChatGPT about work. Technology has become ingrained in our lives, but there’s a delicate issue: Do the “smart” systems we use actually speak Uzbek or do they simply translate English thoughts into Uzbek? This isn’t just a linguistic question; it’s a matter of digital sovereignty.
The era of digital silence is ending, but…
In the last two years, a revolution in Generative Artificial Intelligence (GenAI) has occurred. Models like ChatGPT, Gemini, and Claude have reached the level of “understanding” almost all languages. However, there’s a significant difference between “understanding” and “feeling.” Large Language Models (LLMs) primarily learn from open data on the internet. While the share of English-language data in Common Crawl and other large datasets exceeds 45-50%, Uzbek language data doesn’t even reach 0.1%. In technical terms, this is called the “Low-resource language” problem. What does this mean? For AI, Uzbek is not a “native language,” but a “secondary language” learned with the help of a dictionary. As a result, when we interact with AI, we often encounter artificial, “polished,” and culturally Western phrases.
The risk of “cultural hallucination”
If you ask ChatGPT to “Write about business ethics based on Uzbek national values,” it will likely present you with Western corporate culture concepts using Uzbek words. The problem is that language is not just a collection of words; it’s a code. It encapsulates a nation’s history, worldview, and logical thinking patterns. If we don’t convert high-quality academic and artistic information in Uzbek into digital format and “feed” it to AI models, future generations will begin to adopt foreign cultural patterns through AI. Analysis: AI currently plays merely a “translator” role for Uzbek users. Our goal should be to teach it to “think” in Uzbek.
Technological barrier: The agglutinative language problem
Uzbek is an agglutinative language (where words are formed by adding suffixes). Unlike English or Russian, a single Uzbek word root can have 5-6 suffixes added to convey the meaning of an entire sentence (for example: “kelolmaydiganlardanmisiz” – “are you one of those who can’t come”). Many global models struggle to break down such words into syllables (tokens). This increases the cost and reduces the quality of processing requests in Uzbek.
Solution: National corpus and open-source initiatives
So, what should be done? It’s a mistake to wait for an external miracle. The solution is to create and develop a national digital corpus of the Uzbek language.
Digitization: Thousands of books, newspaper archives, and scientific articles in libraries must be converted to machine-readable format.
Voice data: Speech-to-Text technologies that understand various Uzbek dialects require thousands of hours of audio recordings.
Collaboration: This task can not be accomplished by the state or private sector alone. IT companies, linguists, and the government need to work together on open-source projects.
Language as an economic tool
In the future, a nation’s power will be measured not by its territory or mineral resources, but by its position in the digital world. If the Uzbek language doesn’t find its place in the AI language family, we will remain mere technological consumers. For Uzbek to survive in the digital age, it’s not enough for it to remain just a “language of beautiful literature.” It must become a “Data Language.” Starting this process today is not too late, but tomorrow will definitely be too late. If you’re in the IT field, contribute to open datasets in Uzbek (for example, Mozilla Common Voice or Wikipedia). This is the most significant investment in our future.















