Detection and Description of Neologisms in Korean Lexicography: Methodological Issues in Corpus Balance, Word Unit Bias and LLM Assistance

Nam, Kilim; Lee, Soojin; Jung, Hae-Yun

doi:10.5788/35-1-2045

Services on Demand

Journal

Article

Indicators

Lexikos

On-line version ISSN 2224-0039Print version ISSN 1684-4904

Abstract

NAM, Kilim; LEE, Soojin and JUNG, Hae-Yun. Detection and Description of Neologisms in Korean Lexicography: Methodological Issues in Corpus Balance, Word Unit Bias and LLM Assistance. Lexikos [online]. 2025, vol.35, pp.414-438. ISSN 2224-0039. https://doi.org/10.5788/35-1-2045.

This study explores the potential application of large language models (LLMs) in Korean neologism extraction and dictionary compilation while critically examining the limitations of existing methods, including the bias toward news-oriented data and morphological neologisms. By analysing data from news corpora alongside messenger and online post corpora, the study identifies significant limitations in current news-centred approaches, particularly in detecting the first occurrences and extracting neologisms related to everyday topics. Experimental results involving LLMs demonstrate their potential to address the limitations of news-biased neologism extraction by suggesting unregistered words from diverse web-based contexts. However, issues such as duplication and overgeneration persist. In tasks involving semantic neologism recommendation and dictionary microstructure creation, LLMs performed relatively well with high-frequency and news-biased topics when provided with additional contextual prompts, yet revealed limitations with low-frequency and non-news-biased neologisms. These findings suggest that the performance of current LLMs heavily relies on the diversity of training data and user-provided contextual information. The results of this study underscore the need for further investigation into the critical challenges in neologism research, lexicography, and corpus linguistics, as well as the role lexicography might play in enhancing the performance of LLMs.

Keywords : lexicography; neologisms; unregistered words; news corpus; semantic neologism; representativeness; balance; lexicographic data; macro-structure; large language models.

· abstract in Afrikaans · text in English · English (

pdf )