Persian LLMs

The first artificial intelligence 🧠 Persian LLM

Large language models (LLM) represent a significant advance in artificial intelligence, designed to understand, produce, and manipulate human language with high accuracy. These models, like OpenAI's GPT-4, are built using deep learning techniques, specifically transformer architectures, which allow them to process and generate text based on huge amounts of data. LLMs can perform a wide range of tasks, including answering questions, summarizing documents, translating languages, and even creating original content.

Their ability to mimic the understanding and production of human language makes them powerful tools in a variety of fields, from customer service and content creation to research and development. However, their use also raises important ethical issues, including issues of bias, misinformation, and the responsible use of AI technology.

Training Large Language Models

continual pre-training

Continuous pre-training of LMs, specifically, domain-adaptive pre-training (continuous DAP-training) is a new method to continuously train an LM with a sequence of unlabeled domain morphemes to adapt the LM to these domains to improve their end-task performance. In this approach, a pre-trained model is further trained on a new dataset. It is like adding new chapters to an already written book.DAP training can be achieved by directly updating the LM or training only a small set of additional parameters. While adapter and express can be effective, knowledge transfer among these additional modules is usually challenging and can be imprecise.

Instruction tuning

Fine-tuning is the process of adjusting the parameters of a large pre-trained language model for a specific domain. Although pre-trained language models such as GPT have extensive linguistic knowledge, they lack domain expertise. Fine-tuning overcomes this limitation by allowing the model to learn from domain-specific data to make it more accurate and effective for targeted applications.By exposing the model to specific domains during fine-tuning, the model can gain a deeper understanding of domain nuances. It bridges the gap between a general-purpose language model and a specialized model, unlocking the full potential of LLMs in specific domains or programs.

from scratch

Building large-scale linguistic models (LLM) from scratch is a challenging but rewarding process that requires deep knowledge of machine learning, natural language processing, and software engineering. In this method, the model is designed and built entirely from scratch, meaning no pre-trained model or code is used. Building a large language model (LLM) from scratch and training an LLM both involve training models to understand and produce human-like text, but they differ significantly in scope, complexity, and resource requirements. Here's an overview of the differences:

Architectural design

Development of a new neural network architecture for language modeling
Deciding on layers, nodes, attention mechanisms and other details of the model

Data collection

Gathering a large dataset of different texts to train the model.
Ensuring the quality, diversity and representativeness of data

Training process

Initializing model parameters randomly
Using powerful computational resources
Implementing training regimes, optimization algorithms, and regularization techniques to ensure efficient learning

Evaluation and Iteration

Continuously evaluating the model based on various metrics
Adjusting architecture and training processes based on performance feedback
Repeating multiple training and evaluation cycles

Reinforcement learning from human feedback

These advanced techniques aim to align LLMs with human preferences and ethical guidelines.

Direct Preference Optimization (DPO): This method involves training the model to directly optimize human preferences. DPO uses explicit user feedback to guide the training process, ensuring that the model’s outputs align with what humans find useful and appropriate.

Reinforcement Learning from Human Feedback (RLHF): RLHF combines reinforcement learning techniques with human feedback to improve the model’s performance. Human evaluators provide feedback on the model’s outputs, which is then used to fine-tune the model through reinforcement learning algorithms. This method ensures that the model not only performs well on traditional metrics but also produces outputs that are more aligned with human values and preferences.

Additional Considerations

Ethical Issues and Bias

Training LLMs requires careful attention to ethical issues and potential biases in the data. Ensuring diversity in the training data and implementing techniques to reduce biases are critical steps.

Scalability

The ability to scale training processes to larger datasets and more complex models is a significant challenge that requires robust infrastructure and efficient algorithms.

Efficiency

Developing methods to train models more efficiently, such as optimizing computational resources and reducing training time, is an ongoing area of research.

Large language models are at the forefront of advancements in natural language processing, with diverse training methods that cater to different needs and applications. Training from scratch provides full control but requires extensive resources. Fine-tuning and continual pre-training offer practical ways to adapt and update existing models. Techniques like DPO and RLHF ensure that models align with human preferences and ethical standards. As LLMs continue to evolve, these training methods will play a critical role in shaping the future of AI in language processing.

MCINEXT, in partnership with MCI (Mobile Telecommunication Company of Iran), is proud to present its developed language models, including Sialk (1.3 billion), Ahoran (8 billion), and Ava (13 billion). These versions feature models built from scratch, continual pretraining, and fine-tuning with instructions.

Introduction to Language Models

Sialk Language Model

The Sialk language model (Sialk) was developed from scratch by MCINEXT to produce a 1.3-billion-parameter Persian model. This model was trained using Persian language datasets and, due to its small size (1.3 billion parameters), is highly efficient in terms of speed.

Ahoran Large Language Model

The Ahoran large language model (Ahoran) was developed by MCINEXT through continual pretraining to create a native large Persian model (8 billion parameters). Ahoran has been continually trained on Persian text corpora and is regularly updated with new datasets.

Ava Large Language Model

Ava, developed by MCINEXT in collaboration with MCI, is a Persian language model with 13 billion parameters. Built on Cohere’s Aya product and fine-tuned by the MCINEXT team, Ava is capable of answering questions related to general knowledge and especially excels in RAG (Retrieval-Augmented Generation).

Team Leader

Arash Amini

Technical Lead

Hamed Tahmouresi

Team Members

Zahra Rahimi, Fatemeh Ebrahimi, Saleh Deylami, Somayeh Bakhshayi, Elham Partovi, Alireza Hedayatlou, Mohammadreza Moulapnah, Mohammad Hossein Sadeghi, Sajad Jalali, Sobhan Sherkat

Former Team Members

Salman Khaleghian, Naser Ahmadi, Mohammadreza Ghafourani, Mohammadjavad Taheri, Ali Akbar Badri, Mohammadmehdi Bojani