Abstrakt: | This thesis explores the application of language models to address the challenge of
information extraction from unstructured data, namely Curriculum Vitae documents.
Traditional algorithmic approaches often struggle with the complexity and noise in-
herent in such data, prompting an exploration of use of more advanced techniques. A
benchmark of 63 annotated samples was developed to lay foundation for model eval-
uation, as well as comparison with existing commercial tools. Several small language
models, including SmolLM-1.7B, LLaMA3.2-1B and Qwen2-1.5B, as well larger models
such as LLaMA3.2-8B and GPT-4o-mini were fine-tuned and tested on the benchmark.
Findings reveal that large transformer models outperform other tools in accuracy, while
smaller models offer a practical trade-off for resource-constrained environments. The
study provides performance and error analyses, as well as insights on the impact of
different training data sizes on the small models.
|
---|