Using Korean benchmark datasets to develop and evaluate HyperCLOVA X and other safe and trustworthy language models
The rise of generative AI and large language models (LLMs) has changed our lives in profound ways, and we don’t have to look far to find AI models and services around us. Our two flagship AI models—HyperCLOVA X, an LLM that specializes in the Korean language, and CLOVA X, a conversational AI built on the model—attest to their prevalence.
Along with the remarkable capabilities of LLMs, trust and safety have come to dominate recent discussions on AI. One concern is that models are trained on large amounts of data obtained via web crawling, which may increase the likelihood that they will learn and spread social stereotypes, hate speech, discrimination, and biased value judgments. Compounding the issue is the fact that we cannot expect these models to perform well on Korean-specific knowledge and reasoning if they are not trained on data that reflects Korean culture and language. How, then, do we evaluate AI ethics and safety, and how can we make LLMs learn norms and values unique to Korean society?
In this post, we introduce four benchmark datasets published by NAVER that are widely used to assess Korean-centric models: SQuARe, KoSBi, KoBBQ, and KorNAT.
SQuARe: Sensitive questions and acceptable responses
If LLMs are not careful in responding to sensitive questions, conversations with even users without malicious intent can take unexpected turns. Here, we focus on three types of prompts commonly asked in real life: contentious questions, ethical questions, and questions that ask LLMs to predict the future. These questions in and of themselves are not harmful. Still, if answered carelessly, they may reinforce bias, encourage unethical behavior, and propagate disinformation.
SQuARe, as the acronym suggests, deals with sensitive questions and acceptable responses. This large-scale Korean dataset is made up of 49,000 controversial questions plus 42,000 answers that are acceptable and 46,000 others that are not. Actual headlines from well-known Korean media outlets comprise the dataset, which we used to train HyperCLOVA for generating prompts and completions. Next, a content filtering model classifies ambiguous content for crowdworkers to label as sensitive questions and acceptable responses. During this process, we employ the human-in-the-loop method over three iterative cycles to further refine our LLM. The SQuARe dataset can be a useful benchmark for measuring the level of harm and filtering out harmful responses.
KoSBi: Societal bias towards different social groups in Korea
Data used to train language models may be biased towards people based on their gender, age, or sexual orientation, leading the models to reproduce harmful content containing hate, discrimination, denigration, prejudice, and stereotypes. Many studies have been directed toward eliminating these issues, but the efforts were previously concentrated in English-speaking American society.
The large-scale Korean Social Bias (KoSBi) dataset is one of the few benchmarks focused on Korean culture and language. Taking the standards of the UN’s Universal Declaration of Human Rights and the National Human Rights Commission of Korea as a reference point, this dataset observes social biases across demographics in 15 attributes (such as gender, age, religion, political affiliation, and area of birth) and 72 social groups. Within the dataset, there are 34,000 pairs of sentences, one describing a specific context about a social group followed by another in the form of safe or harmful content. Training and evaluating LLMs with this dataset have proven effective in detecting and filtering harmful content.
KoBBQ: Benchmark for measuring bias in Korean society
Another benchmark against which to measure social bias inherent in a language model involves question answering, which works by eliciting the model to generate responses given different contexts. When the context is under-informative, models often rely on stereotypes to answer questions, and the level of bias they display translates to a bias score. The Bias Benchmark for Question Answering (BBQ) is a dataset that employs this method but is centered on U.S. English-speaking contexts. For example, BBQ either includes stereotypes nonexistent in Korea or fails to include biases that manifest in Korean society.
The Korean Bias Benchmark for Question Answering (KoBBQ), a dataset of 76,000 sample question sets, aims to overcome this limitation by reflecting the workings of Korean culture and society. Before applying BBQ in the Korean context, we conducted a large-scale survey to ensure biases in American society could be transferred to Korean society. Most Korean-centric language models nowadays are assessed using KoBBQ, where they are evaluated for their accuracy and given a bias score. This benchmark shows why the local cultural and societal context matters when measuring bias.
KorNAT: Alignment with Korean values and knowledge
Before deploying and using language models, they must be benchmarked against local criteria and tested on national culture and general knowledge. The Korean National Alignment Test (KorNAT) is a dataset of multiple-choice questions that evaluates responses at two levels: Korean social values and common knowledge.
– Social value alignment, made up of 4,000 examples, tests LLMs on how well they understand the social values unique to a particular country. Social values are largely shared by local people and reveal prevalent attitudes and opinions regarding a wide range of social issues. As with KoBBQ, we conducted a huge survey to find out Korean citizens’ social norms and values and see whether a consensus was formed.
– Common knowledge alignment, made up of 6,000 examples, focuses on how well LLMs can perform on country-specific knowledge. Common knowledge encompasses knowledge widely shared by the general public and may sometimes be regarded as basic or elementary. It covers a variety of topics, from school subjects like Korean, English, math, and science to historical facts and social norms.
Based on these two criteria, we’ve shared our test results of several language models, concluding the paper with an emphasis on the importance of knowledge and reasoning abilities in these areas.
Conclusion
NAVER and the wider tech industry, academia, and government policymakers are conducting research and development to tackle AI safety from many perspectives. Here, we focused on benchmark datasets NAVER released to the public, which can be used to train and evaluate HyperCLOVA X and other LLMs. Continuing to build these kinds of datasets and sharing findings with the public will lead to the safer use of AI models.
*Sources
– SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration, [Paper], [Dataset]
– KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application, [Paper], [Dataset]
– KoBBQ: Korean Bias Benchmark for Question Answering, [Paper], [Project page]
– KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge, [Paper], [Project page]