Research Engineer, Model Evaluations

Anthropic Seattle , WA 98113

Posted 2 months ago

Apply

This Job is not relevant Tell us why

About the role:

We are looking for Research Engineers to build evaluations for our Claude family of Large Language Models. Your job will be to design and implement evaluations that allow Anthropic researchers and decision makers, and members of the public, to understand Claude's abilities and personality. As a Research Engineer focused on Evaluation, you'll work closely with our research team to design experiments and build evaluation infrastructure. You'll help establish Anthropic as the leader in extremely well-characterized AI systems whose performance is exhaustively measured and validated across a wide range of important tasks. We aim to produce extremely well-benchmarked large language models with known performance on a wide range of tasks, turning ambiguous notions of "intelligence" into clear metrics.

Responsibilities:

Designing and running a new evaluation that tests Claude's reasoning capabilities, and creating a compelling visualization that illustrates the results
Running experiments to determine how prompting techniques affect results on industry benchmarks
Improving the tooling that researchers use to implement evaluations
Explaining our evaluations and their results to internal decision makers and Stakeholders
Collaborating with a research team to develop a robust evaluation for a new model capability they are developing

You may be a good fit if you:

Have significant Python programming experience / machine learning research
Are excellent at data visualization
Have experience using Large Language Models such as Claude
Are results-oriented, with a bias towards flexibility and impact
Pick up slack, even if it goes outside your job description
Enjoy pair programming (we love to pair!)
Want to learn more about machine learning research
Care about the societal impacts of your work
Have clear written and verbal communication
You want to design and implement rigorous evaluations to deeply understand the capabilities, personality, and safety of large language models like Claude.
You're excited to turn fuzzy notions of "AI intelligence" into clear, well-defined metrics that provide insight to researchers, decision-makers and the public.
You're energized by the challenge of assessing and steering powerful AI to be safe and beneficial.

Strong candidates may also have experience with:

Building user interfaces for data analysis
Developing robust evaluation metrics for language models
Handling textual dataset sourcing, curation, and processing tasks at scale
Statistics

Deadline to apply: None. Applications will be reviewed on a rolling basis.

Show Full Description

See how you match
to the job

Upload my resume

Download the
LiveCareer app and find
your dream job anywhere

Apply