Projets-App
Projets
Editer un projet
Cadre de travail
Unité de recherche rattachée au CNRS
IRIF
LIPN
LMF
Information du ou de la responsable scientifique du projet :
Prénom
Nom
E-mail
Présentation du projet scientifique
Nom du project
Résumé publiable du projet (10 lignes max.)
The reconstruction of set-theoretic types for dynamic languages can be prohibitively time-consuming. Our current prototype, [CDuce DynLang](https://www.cduce.org/dynlang/), takes an untyped program written in an "idealized" dynamic language, reconstructs an annotation tree for the program, and checks its typing. Although checking whether an annotation tree makes a program well-typed is extremely fast, the reconstruction of the annotation tree can be very slow. The goal of this project is to optimize the type inference process by fine-tuning a Large Language Model (LLM) to produce a (possibly partial) annotation tree for a program. The type-checker will subsequently verify the annotation tree, resorting to the slower reconstruction only if the check fails. By integrating the fine-tuned LLM into the type inference process, we aim to significantly reduce the time required for reconstructing set-theoretic types in dynamic languages while maintaining a high level of accuracy.
Champs thématiques adressés (séparés par des virgules)
Type
proof-of-concept
prototype
maintenance/évolution logiciel existant
prématuration
Décrivez les innovations attendues à l'issue du projet (5 lignes max.)
By integrating the fine-tuned LLM into the type inference process, we aim to significantly reduce the time required for reconstructing set-theoretic types in dynamic languages while maintaining a high level of accuracy.
Argumentaire scientifique qui décrit le projet, l'état de l'art, les travaux antérieurs et les résultats attendus (20 lignes max.)
The reconstruction of set-theoretic types for dynamic languages can be prohibitively time-consuming. Our current prototype, [CDuce DynLang](https://www.cduce.org/dynlang/), takes an untyped program written in an "idealized" dynamic language, reconstructs an annotation tree for the program, and checks its typing. Although checking whether an annotation tree makes a program well-typed is extremely fast, the reconstruction of the annotation tree can be very slow. The goal of this project is to optimize the type inference process by fine-tuning a Large Language Model (LLM) to produce a (possibly partial) annotation tree for a program. The type-checker will subsequently verify the annotation tree, resorting to the slower reconstruction only if the check fails. The project will be developed in the following phases: 1. Generate a random number of well-formed programs to create a diverse dataset. 2. Run the type inference on the generated programs to produce a dataset of program-annotation tree pairs. 3. Clean up the annotations in the dataset, providing only the most relevant information to improve the LLM's performance. This will be a key step of the research since the goal is to determine which information in the annotation is relevant. In particular, we will have to find the right balance between enough information so that the subsequent partial checking/inference is executed efficiently, and not too much information so that the subsequent inference will not miss important solutions. 4. Split the dataset into training and testing subsets to evaluate the LLM's accuracy and generalization capabilities. 5. Fine-tune an LLM on the LIP6 cluster, utilizing its computational resources for efficient model training. By integrating the fine-tuned LLM into the type inference process, we aim to significantly reduce the time required for reconstructing set-theoretic types in dynamic languages while maintaining a high level of accuracy.
Description du besoin en ingénierie
Évaluez le temps ingénieur dont vous avez besoin pour votre projet:
Frequence
2 jour/semaine
3 jour/semaine
Durée
1 mois
2 mois
3 mois
4 mois
5 mois
6 mois
Quelles sont les compétences spécifiques d'ingénierie attendues ; notamment, langages ou frameworks qui seront utilisés dans le projet, outils existants, connaissances requises ?
To successfully complete the project described, the following programming skills and knowledge areas are desired: 1. Proficiency in OCaml, since it is the language used by our prototype and, possibly, in Python, as it is widely used in machine learning and natural language processing projects. However, the specific language requirement may depend on the existing prototype and the LLM framework being used. 2. Familiarity with the basic principle of Large Language Models (LLMs) or other statistical models: Understanding the principles behind LLMs and their applications may be useful for fine-tuning the model in this project. 3. Experience with machine learning frameworks: Familiarity with popular machine learning libraries like TensorFlow, PyTorch, or scikit-learn can be advantageous for implementing and fine-tuning the LLM.
Décrivez les activités qui seront confiées à·aux ingénieur·e·s (si possible, avec un planning de tâches attendues)
Task for the Software Dev Engineer 1. Program generation 2. Modify the prototype to generate the data-set with cleaned annotations. The prototype should be flexible enough seamlessly test the different variations of annotation suggested by the researchers. 3. Fine-tuning of the LLM 4. Benchmarking
Commentaires
Mettre à jour