Assessing Language Model Deployment with Risk Cards

Introduction

When establishing documentation, reporting or auditing standards, we need clear terminology.

Hazards describe a potential source of an adverse outcome. In physical analogies, bleach, radioactive material, or a swimming pool each amount to a hazard – there is potential for adverse outcomes depending on action states.
Harms describes the adverse outcomethat materialised from a hazard. Bleach can cause a chemical burn if spilled, cancerous cells can be accelerated by radioactive material, or a non-swimmer can drown in deep water.
Risks describe the likelihood or probability of a hazard becoming harmful and its impact. When the risk is unknown, or its impact uncertain, one possible regulatory strategy is for policy makers, organizations, and other stakeholders to adopt the precautionary principle, especially when the science around the risk is unknown or the impact indeterminable.

Adopting this terminology for language model (LM) behaviors as hazards, there is an expansive literature documenting a wide array of potential harms to various human groups. However, the risk of harm depends on the context or application in which the LM is applied and its intended audience.

If false or misleading information is identified as a harm, this behaviour may pose a high risk when a user asks an LM for political information, but perhaps a low risk in creative writing applications.

We argue that the current practices for establishing and understanding LM risks in situ are inadequate for two reasons.

First, taxonomies of LM harms are invaluable for mapping the harm landscape but too broad for individual risk assessments; a “one size fits all” approach cannot handle the generality of LMs and map to specific risks in their downstream applications. Varying requirements between models and contexts make it inappropriate to transfer entire taxonomy-based assessment procedures from one exercise to another.
Second, model-specific standards like model cards or data statements are well-suited to specific artifacts but too narrow because some risk states may be shared across artifacts and pooling this knowledge is helpful. Not all risks are present in every application scenario/deployment, and each deployment has different priorities. It’s not clear how to efficiently map general knowledge about LM risks and harms to individual application scenarios. Thus, we need a framework for adapting these tools to their contexts.

RiskCards provide a decomposition and specification of ethical issues and deployment risks in context, and how these interact with people and organizations. RiskCards are motivated by four principles:

Risk-Centric: Contrary to other work, Risk Card does not investigate individual models, tasks, or datasets. Instead, it proposes a structure centered on risks – for naming, delineating, describing, detecting, and comparing them. Having a structured description of the risk and the harm it can evoke creates common knowledge base for risk understanding and mitigation.
Participatory: specific risks can be added and edited by anyone – thus avoiding the positionality of academic or industry labs dictating which risks are the most pertinent to focus on and how they manifest harm.
Dynamic: The open-source nature of this resource allows new cards to be incorporated or existing cards to evolve, merge or split. This dynamism in documentation is important for handling emergent properties of LMs (new risks that emerge as they scale).
Qualitative: Automated evaluation of risks, e.g., via benchmarks, can provide a brittle assessment tool which poorly handles changes to temporal, linguistic, social, or cultural context. To complement automated evaluation procedures, RiskCards are designed to be flexible and reflective, centering the importance of human-led evaluation for risk and harm interpretation.

The general goal with RiskCards is to provide paths for developing, deploying and using LMs safely. This is achieved by

(i) pooling the knowledge of risk assessments across AI trainers and evaluators, such as by sharing sample prompts which do and do not instantiate harmful outputs
(ii) presenting concise and standardised risk summaries to enable informed and intentional choices about how downstream users should work with a LM and its outputs.

There are many uses for RiskCards, includes:

(i) auditors conduct due-diligence on a model using RiskCards prior to acquisition or downstream use
(ii) AI trainers pair model releases and model cards with tagged RiskCards which are structured so comparable across models
(iii) researchers draw on the set of RiskCards to identify new and emergent risks which have yet to be tackled or benchmarked
(iv) red-teamers base explorations in the set of existing RiskCards as guidance and inspiration for an exercise
(v) policy makers determine minimum standards and guardrails that must be developed before deploying systems
(vi) people at large can use the risk cards to challenge developer assumptions and demand safeguards/restitution.

Structure of a RiskCard

Each RiskCard must:

Name and describe a risk: Each RiskCard begins with a concise name for the risk followed by a brief description. The description should be sufficient to make it clear how the risk presents and also delineate the scope of the risk. It may be helpful to include exemplifying references.
Provide evidence or a realistic scenario of risk impact: It is important that RiskCards are grounded to a concrete risk with demonstrable harm. To this end, each card should contain a credible citation or clear example scenario demonstrating how the relevant risk causes harm.
Situate that risk with respect to existing taxonomies of LM risk/harm: To aid selection and comparison of relevant risks, each RiskCard should include the risks’ placement within taxonomies of harm. Some risks might not fit in any of these categories, and if so, that should be stated; other risks may fit in more than one category, and if so, all categories should be named which capture essential aspects of the risk.
Describe who may be affected, and how, if the risk manifests (i.e. its impact): A range of actors can suffer a range of harms from a risk. Relevant intersections of these should be noted on the card, as pairs of actor and harm type.
Clarify what is required for the risk to manifest: Not all outputs present a risk simply from being read. Sometimes they may have to be used in a specific setting, or more than once, for a risk to be relevant. The conditions required for harm to present should be specified.
Give concrete examples of harmful generations from existing LMs: The RiskCard should give examples of prompt-output pairs that demonstrate the risk. These should, where possible, be from real exchanges with a LM, but we recommend not identifying which model or platform was used. This is because models change rapidly over time and the output will not be representative. Thus, sample prompt-output pairs are intended to be an exemplar not exhaustive list, acting as inspiration for further probes.

RiskCard now further establish possible dimensions of harm, including who is at risk, what categories of harms can arise, and which actions or conditions are required for harm to materialise.

Dimensions of Harm

Categorising risks in RiskCards involves describing who can be harmed when the risk manifests, what kind of harm may be done and what conditions must be present for this harm to materialise.

Building these descriptions in a structured way, from combinations of a set list of actors and categories of harm, makes it easier to identify relevant RiskCards for a new LM application.

Who can be at risk?

(i) Model providers bear responsibility for models they provide access too. For example, the way that a model’s capabilities are presented may bring reputational risks.
(ii) Developers are at risk of harm in some situations, as they interact with material during the course of their work, and perhaps store it hardware that they are responsible for.
(iii) Text consumers are those who read the output text; they may be reading it in any context, including directly from the model as it is output, or indirectly, such as a screenshot of a social media post.
(iv) Publishers are those who publish or share model outputs.
(v) Finally, external groups of people represented in generated text can be harmed by the text, for example when text contains false information or propagates stereotypes. These groups can be particularly vulnerable because not only do they lack agency in the process, they may not be aware that the text about them has been generated.

What kind of harms can result from risks?

(i) Representational harms arise through (mis)representations of a group, such as over-generalised stereotypes or erasure of lived experience.
(ii) Allocative harms arise when resources are allocated differently, or re-allocated, due to an model output in a unjust manner. This can include lost opportunities or discrimination.
(iii) Quality-of-service harms are defined by Shelby et al. (2022) as “when algorithmic systems disproportionately fail for certain groups of people along the lines of identity,” and includes impacts such as alienation, increased labor, or service/benefit loss.
(iv) Inter & intra-personal harms occur when the relationship between people or communities is mediated or affected negatively due to technology. This could cover privacy violations or using generated language to brigade.
(v) Social & societal harms describe societal-level effects that result from repeated interaction with LM output; for example, misinformation, electoral manipulation, and automated harassment.
(vi) Legal harms describe outputs which are illegal to generate or own in some jurisdictions. For example, blasphemy is still illegal in many jurisdictions (USCIRF, 2019), including in the anglosphere.Written CSAM⁴⁴Child Sexual Abuse Material is illegal to create or own in many jurisdictions. Copyrighted material presents another kind of legal risk. LMs can lead to breaches of the law through multiple routes, and this is signified through this ‘legal harms’ category.

What actions are required for harm to manifest?

Many risks require some kind of action or set of conditions in order to yield harm.

Some text can inflict harm by being read (Kirk et al., 2022a); for example, the propagation of negative stereotypes about real people, or graphic descriptions of violent acts.
Other text requires situational context for harm risk to manifest: for example, authoring many fake comments evincing a certain view and posting of them online as genuine, in an astroturfing effort (Keller et al., 2020).
In other cases, text can be harmful in one setting but fine in another. For example, the tendency of large LMs to generate plausible-sounding false claims can be harmful, but only if the output is presented as truthful.

When adding this information to a RiskCard, assessors should consider what has to happen for harm to manifest. They can consider whether there are situations in which the generated text would not cause harm, as well as the steps and external contexts required for harm to come to pass.

Example Risk Cards

Risk card for hate speech.

Risk card for prompt extraction.

Share the Post: