Cybersecurity Benchmark Summer Intern Location: Remote Company: Pilotcrew AI Type: Internship (monthly basis extension)- Performance and project requirement based extension Eligibility: Currently pursuing a Master’s or PhD degree. About Pilotcrew AI Pilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing. Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems. Role Overview We are building a large-scale benchmark for evaluating the cybersecurity capabilities of frontier AI LLMs. To grow this benchmark, we need hands-on security engineers who can craft real-world vulnerability tasks that are genuinely difficult for state-of-the-art LLMs and agentic systems. Your core output: carefully designed benchmark instances real software vulnerabilities paired with well-formed task specifications and validated evaluation oracles that expose the limits of current AI systems and drive progress in AI safety research. Key Responsibilities Create cyber security benchmark tasks that fail frontier LLMs Build and maintain containerised benchmark environments (Docker, libFuzzer, ASan/MSan) Produce multi-level difficulty variants: Level 0 (no description) through Level 3 (patch diff supplied) Collaborate with researchers to analyse and document failure patterns of AI agents Write clear, reproducible vulnerability descriptions (≤200 words) usable as task prompts Stress-test tasks against frontier LLM agents (OpenHands, Codex CLI) and document failure modes Ensure benchmark quality: no data duplication, sufficient locating information, 96%+ precision targets Follow responsible disclosure for any zero-day vulnerabilities discovered during benchmark work Required Skills Vulnerability Research Expertise Familiarity with memory safety vulnerabilities in C/C++ codebases, including heap/stack overflows, use-after-free, null dereferences, and uninitialized memory. Demonstrated ability to reproduce known CVEs and write proof-of-concept (PoC) inputs that reliably trigger sanitizer crashes (ASan, MSan, UBSan) Comfort navigating large, unfamiliar codebases (100k–7M+ lines) to locate vulnerable code paths Fuzzing & Toolchain Exposure to coverage-guided fuzzing concepts and tools such as libFuzzer, AFL++, or OSS-Fuzz through academic, personal, or open-source projects. Familiarity with compiling projects with sanitizer flags (AddressSanitizer, MemorySanitizer) using GCC or Clang Familiarity with Docker for building and distributing reproducible execution environments Patch & Exploit Analysis Ability to read unified diffs and extract semantic meaning about the vulnerability being patched Understanding of one-day / N-day attack workflows: from patch diff to working PoC Basic understanding of Git workflows and experience using version control to investigate code changes and identify the source of issues. Communication & Rigor Can write concise, technically precise vulnerability descriptions (target: ≤200 words) that contain sufficient localisation information for reproduction without leaking the fix Comfortable with scripting in Python or Bash for automation of build, evaluation, and filtering pipelines Nice to Have Familiarity with evaluating AI coding agents (OpenHands, Codex CLI, SWE-agent, or similar) Familiarity with large language model APIs and prompt engineering for automated quality-judgment pipelines Research background — publications or detailed write-ups on vulnerability discovery, fuzzing, or program analysis Awareness of software vulnerability reporting practices, CVEs, and responsible/coordinated vulnerability disclosure processes. Knowledge of broader vulnerability classes beyond memory safety: logic flaws, cryptographic weaknesses, web/mobile vulnerabilities Hands-on CTF (Capture the Flag) competition experience, especially in pwn or reverse-engineering categories Familiarity with symbolic execution or static analysis tools (angr, CodeQL, Infer) What We Value Strong curiosity and research mindset Ability to translate theory into practical systems Ownership and bias toward execution Comfort working with ambiguity and evolving problem spaces Clear and structured technical communication Ability to thrive in a fast-paced startup environment with high ownership The internship is offered on a month-to-month basis, and may be extended depending on individual performance and project requirements.

Find Remote Jobs That Hire Worldwide

Cybersecurity Benchmark Summer Intern

About this role

Job Details