The American University of Armenia (AUA) AGBU Papazian Library has announced the launch of the project titled Digitizing Armenian Linguistic Heritage (DALiH): Armenian Multivariational Corpus and Data Processing, coordinated by Victoria Khurshudyan, National Institute for Oriental Languages and Civilizations (INALCO), SeDyL.
Funded by the French National Research Agency, the project aims to build the first-ever open-access and open-source unified digital linguistic platform for the whole spectrum of the Armenian language variations. In particular, annotated corpora will be compiled for Classical Armenian and Modern Western Armenian, as well as a pilot corpus for Middle Armenian, three pilot corpora for dialects, and an updated Modern Eastern Armenian corpus on the basis of the existing one.
As a project partner, the Digital Library of Classical Armenian Literature (Digilib) of AUA will provide its collection of digitized texts in Classical Armenian and Western Armenian. Additionally, Digilib will support the implementation of the DALiH project through the digitization of other relevant materials.
“Discussions about the project started during the ‘Digital Armenian’ conference held in Paris in October 2018. I am excited that our efforts in making the project of Digitizing Armenian Linguistic Heritage a reality have paid off, and we can now announce its successful launch,” states Hovhannes Kizogyan, technical director of Digilib. “We plan to organize workshops for linguists in computational technology to equip them with the skills needed for the project.”
Within the DALiH project, research will be conducted in Natural language processing (NLP) and linguistic perspectives so as to provide full grammatical annotation and Automatic speech recognition (ASR) models in the above-mentioned Armenian varieties. “Multi-approach deep-learning and rule-based resources will be designed in order to process the written and oral databases and to cross-check their value for further corpus enlargement, in a context of multiparameter language variation for an under-resourced language,” was mentioned in the press release issued by the National Institute for Oriental Languages and Civilizations (INALCO), the coordinating organization of the project.
NLP-based linguistic research, in particular on the automatic identification of the language, the calculation of the distance between varieties, and the lexical and morphological disambiguation, will be conducted with an aim to revisit the state of existing research issues and to introduce new problems supported by the written and oral data made available by the project.