How OpenAI, Razorpay and Addepar are building the future of product security
It’s inspiring to see product security teams innovating so rapidly with Large Language Models. On the other hand, I keep hearing from industry colleagues that most tinkering feels like it should be useful but rarely leads to the desired adoption. How can we bridge the gap between the promises of LLMs, and the difficulty of solving real well-defined problems?
In this post, I’m going to look at 3 recent examples from the security teams of Addepar, Razorpay and OpenAI to see what can we learn about building LLM solutions that solve real problems and gain adoption at your organization.
Let’s start with a lightning round on the three projects.
Secure code reviews at Razorpay
Razorpay does manual PR-level reviews of their P0 “crown jewel” components. They feel this is highly leveraged since the P0 components present the most potential risk, but they recognize that the manual reviews are tedious and slow. Their goal: making manual PR reviews faster and less tedious.
To accomplish this, they compared the ability of CodeBison, Gemini 1.0, Gemini 1.5 and GPT-4 to find vulns in individual files. They let the models do most of the work with minimal prompting. The result: they now use these LLM scans for their mandatory secure code reviews of their crown jewel repos.
They measured their accuracy by seeing how well they were able to discover vulns OWASP’s JuiceShop repository. This led to a reported 75% accuracy and “minimal false positives.”
Surfacing the riskiest PRs at Addepar with RedFlag
Addepar, similar to Razorpay, does manual PR reviews in their bi-weekly platform releases. With ~400 individual PRs, that’s a lot of reviewing work, so they set out to identify the PRs most worth reviewing. This is pretty similar to what we do at Remy for design documents.
Addepar really describes their approach best with the diagram below. For every release, the prodsec team gets a list of PRs that seem worth reviewing based on the code changes themselves and context from linked Jira tickets. RedFlag also specifies which files would be most worth reviewing and auto-generates a security test plan to give the prodsec team some inspiration.
They adjusted the pipeline based on measuring their accuracy and specificity on manually labelled data from previous commits (PRs with a boolean “Needs Review” label).
They were able to reduce their average reviewed PRs from 400 to fewer than 100. Based on the testing they did on their manually labelled dataset, they achieved a 92% accuracy rate, with some few false positives and no false negatives. If you think this might be useful for your team, the whole project is open source.
Message triage, IR and project risk classification at OpenAI
OpenAI’s project has the least information since it’s just an open source repo, without a blog post really explaining the context. Their focus was on automating some specific tedious manual Slack workflows. Specifically, triaging requests to the right on-call person, helping engineers decide whether their project needed a security review, and helping with fact-gathering in incident response scenarios.
The built Slackbots for each of these scenarios but unfortunately we don’t know how successful the deployment of this project has been, since we only have access to the Readme files and source code.
What can we learn about solving real problems with LLMs?
I definitely encourage you to go and read the original blog posts and repos for these projects. But first, here are my top four learnings from these projects, as well as my own experience building Remy, an LLM-powered tool to help prodsec teams find and review the riskiest engineering plans from across their organization.
Build toward focus, not completion
The thing each of these projects have in common is that they solve for focusing efforts of prodsec members, not replacing the efforts! We think this is a great place to start building with LLMs today. Building something to focus your team’s valuable time is going to be an order of magnitude less effort than attempting to fully automate something, but is probably going to give you comparable results.
Don’t build a chat bot!
Another thing these have in common: not chat apps. LLMs are most powerful when used in pre-determined ways. In general, I suggest focusing on building LLM solutions where the starting point is something that already happens in your organization. For example when a PR is created, a design document is written, or an engineer asks for help in your Slack channel.
Build for your own idiosyncratic processes
Let’s be honest. You’re a busy prodsec team. You’re not going to build an AI AppSec Engineer. And most likely, you won’t take what you build and scale it up across the whole world. So see whether you can find idiodyncratic processes that might benefit from some LLM automation. It doesn’t matter if these are highly specific to your organization. In fact, it’s even better because it pretty much guarantees you’re solving a real problem instead of just tinkering.
Don’t stress the evaluations
None of these projects attempted to do any super sophisticated evaluations of their results. For something home-brewed, I suggest writing down a list of 3-4 requirements and manually checking that they’re met after running your flow on 1-2 dozen different items (PRs, commits, Slack messages). This can be done manually without needing a purpose-built evaluation pipeline, and will still give you a good degree of confidence. Trust me, you’ll find a bunch of corner-cases with time!
In summary, if you make sure to build toward focus, helping your team use their time for efficiently, hook into existing events, instead of requiring someone to go to your app to start using it, build for your own idiosyncratic processes, without focusing on scale or generalizability, and resist the temptation to over-invest in evaluations, I can pretty much guarantee you’re going to solve a real problem at your organization and get the adoption you were hoping for! Good luck!
If you don’t want to build your own LLM solution, but still want to help your team focus, consider checking out Remy. We have built Remy to help prodsec teams scale secure design reviews to solve issues early, reducing engineering rework and vuln management pains.
Companies like Instacart uses Remy every day to identify the riskiest engineering plans across their organization, and review them with speed and consistency.