Research Engineer, Model Evaluations

Anthropic Seattle , WA 98113

Posted 2 months ago

About the role:

We are looking for Research Engineers to build evaluations for our Claude family of Large Language Models. Your job will be to design and implement evaluations that allow Anthropic researchers and decision makers, and members of the public, to understand Claude's abilities and personality. As a Research Engineer focused on Evaluation, you'll work closely with our research team to design experiments and build evaluation infrastructure. You'll help establish Anthropic as the leader in extremely well-characterized AI systems whose performance is exhaustively measured and validated across a wide range of important tasks. We aim to produce extremely well-benchmarked large language models with known performance on a wide range of tasks, turning ambiguous notions of "intelligence" into clear metrics.

Responsibilities:

  • Designing and running a new evaluation that tests Claude's reasoning capabilities, and creating a compelling visualization that illustrates the results

  • Running experiments to determine how prompting techniques affect results on industry benchmarks

  • Improving the tooling that researchers use to implement evaluations

  • Explaining our evaluations and their results to internal decision makers and Stakeholders

  • Collaborating with a research team to develop a robust evaluation for a new model capability they are developing

You may be a good fit if you:

  • Have significant Python programming experience / machine learning research

  • Are excellent at data visualization

  • Have experience using Large Language Models such as Claude

  • Are results-oriented, with a bias towards flexibility and impact

  • Pick up slack, even if it goes outside your job description

  • Enjoy pair programming (we love to pair!)

  • Want to learn more about machine learning research

  • Care about the societal impacts of your work

  • Have clear written and verbal communication

  • You want to design and implement rigorous evaluations to deeply understand the capabilities, personality, and safety of large language models like Claude.

  • You're excited to turn fuzzy notions of "AI intelligence" into clear, well-defined metrics that provide insight to researchers, decision-makers and the public.

  • You're energized by the challenge of assessing and steering powerful AI to be safe and beneficial.

Strong candidates may also have experience with:

  • Building user interfaces for data analysis

  • Developing robust evaluation metrics for language models

  • Handling textual dataset sourcing, curation, and processing tasks at scale

  • Statistics

Deadline to apply: None. Applications will be reviewed on a rolling basis.


icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Research Engineer, Model Evaluations

Anthropic