Category
Safety & Alignment
-
-
Operator System Card
-
Deliberative alignment: reasoning enables safer language models
-
Sora System Card
-
An update on our safety & security practices
-
An update on our safety & security practices
-
Disrupting a covert Iranian influence operation
-
GPT-4o System Card
-
Finding GPT-4’s mistakes with GPT-4
-
Expanding on how Voice Engine works and our safety research
-
OpenAI safety practices
-
OpenAI’s commitment to child safety: adopting safety by design principles
-
Democratic inputs to AI grant program: lessons learned and implementation plans
-
How OpenAI is approaching 2024 worldwide elections
-
Practices for Governing Agentic AI Systems
-
Superalignment Fast Grants
-
Weak-to-strong generalization
-
Frontier risk and preparedness
-
DALL·E 3 system card
-
GPT-4V(ision) system card
-
OpenAI Red Teaming Network
-
Using GPT-4 for content moderation
-
Confidence-Building Measures for Artificial Intelligence: Workshop proceedings
-
Frontier Model Forum
-
Frontier Model Forum
-
Moving AI governance forward
-
Frontier AI regulation: Managing emerging risks to public safety
-
Insights from global conversations
-
Governance of superintelligence
-
Language models can explain neurons in language models
-
Our approach to AI safety
-
Planning for AGI and beyond
-
How should AI systems behave, and who should decide?
-
Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk
-
Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk
-
Our approach to alignment research
-
A hazard analysis framework for code synthesis large language models
-
A hazard analysis framework for code synthesis large language models
-
AI-written critiques help humans notice flaws
-
Best practices for deploying language models
-
Measuring Goodhart’s law
-
Economic impacts research at OpenAI
-
Lessons learned on language model safety and misuse
-
Aligning language models to follow instructions
-
Summarizing books with human feedback
-
Improving language model behavior by training on a curated dataset
-
Learning to summarize with human feedback
-
Safety Gym
-
Benchmarking safe exploration in deep reinforcement learning
-
Fine-tuning GPT-2 from human preferences
-
Testing robustness against unforeseen adversaries
-
Why responsible AI development needs cooperation on safety
-
Transfer of adversarial robustness between perturbation types
-
Introducing Activation Atlases
-
AI safety needs social scientists
-
Learning complex goals with iterated amplification
-
Improving language understanding with unsupervised learning
-
AI safety via debate
-
Preparing for malicious uses of AI
-
Learning from human preferences
-
Attacking machine learning with adversarial examples
-
Adversarial attacks on neural network policies
-
Faulty reward functions in the wild
-
Semi-supervised knowledge transfer for deep learning from private training data
-
Concrete AI safety problems
-
Adversarial training methods for semi-supervised text classification