Chris Butler | How I Test AI Agents at GitHub

May 27

Chris Butler - Product Operations at GitHub

In this episode, Chris gives a candid behind the scenes look into what’s working, what’s failing, and why experimentation itself may become one of the most important capabilities in the AI era.

Have a Listen

Summary

In this episode I’m joined by Chris Butler. He’s a longtime product leader and operator whose career spans companies such as Microsoft, Google, Facebook, and now GitHub, where he works on agentic workflows across the organization.

We explore how AI is reshaping the way modern product teams think, collaborate, ship and its ripple effects on how we manage process and decision making. Chris and I chat about the messy realities behind agentic systems such as why removing too much friction can actually hurt decision quality and why qualitative research matters more now than ever before.

Chris gives a candid behind the scenes look into what’s working, what’s failing, and why experimentation itself may become one of the most important capabilities in the AI era.

If you’ve been wondering what testing AI Agents actually looks like inside a cutting edge company, this episode is for you.

Takeaways

AI is collapsing traditional product development workflows, but not necessarily eliminating the need for product managers, engineers, or designers. Instead, roles are decomposing into smaller tasks where humans and machines each handle different types of work.
Removing all friction from product development can actually reduce decision quality. Chris argues that tension between desirability, viability, and feasibility perspectives is still critical because reasoning often happens through human discussion, not just inside individual minds or AI systems.
AI-generated “rude feedback” tools can help teams improve ideas faster because people are often more receptive to harsh critique from a machine than from another human. GitHub experimented with sarcastic AI Q&A systems that surfaced weak assumptions and missing details without the reputational risk of peer criticism.
The future of AI inside organizations may be less about autonomous agents replacing humans and more about “process as code.” GitHub is experimenting with natural-language policy documents that both humans and agents can read to automate operational workflows, release management, and risk detection.
Product teams are at risk of building faster without learning faster. Chris warns that vibe coding and rapid prototyping can unintentionally reduce time spent talking to customers and conducting qualitative research, which still remains essential for understanding mental models and uncovering hidden assumptions.
Agentic workflows become most valuable when they reduce operational toil instead of replacing human judgment. GitHub is using AI to automate repetitive coordination tasks like release tracking, documentation generation, and status updates so teams can spend more time on strategic thinking and collaboration.
Internal experimentation matters just as much as customer-facing innovation. Chris emphasizes that many AI workflow experiments inside GitHub are intentionally small, lightweight tests designed to explore possibilities quickly before deciding whether to scale, modify, or abandon them.
The biggest long-term challenge for enterprise AI adoption may not be model capability, but integration, governance, and organizational coordination. Authentication, permissions, fragmented tooling, disconnected workflows, and siloed information remain major barriers to making agentic systems truly useful at scale.

Guest Links

LinkedIn: https://www.linkedin.com/in/chrisbu/

GitHub Next: https://githubnext.com/

Transcript

David J Bland (00:01.283)

Welcome to the podcast, Chris.

Chris Butler (00:03.074)

Thank you. Thanks for having me. I'm excited to talk to you and talk about all of the crazy, agentive stuff we've been doing recently.

David J Bland (00:10.393)

Oh, I am so excited to talk to you. I feel like we've known each other virtually for a while and we're part of the kind of same peer groups and everything. And I've definitely been following your work for a while. And I am really excited to learn and dig in to some agentic stuff with you and how you test it. But maybe before we jump into that, can you give people a little background on yourself and how you ended up at GitHub?

Chris Butler (00:25.912)

Yeah. Yeah.

Yeah, absolutely. guess I'd call myself like a product manager hiding in an operations role, which is kind of my role today. But I've been a product manager for a very long time. Like the places you would have heard of would be like Microsoft, Waze, Kayak, Facebook Reality Labs, Google, Cognizant. I even did my own startup. I worked for a venture capital firm as like a, basically like an experimenter and person that helped with the portfolio companies. But I have a lot of different experiences. I also have a lot of weird side projects. So I write sci-fi on the side.

do role playing games and a bunch of other stuff. But what really brought me to GitHub is, I guess, like the last decade, I've been working adjacent to like, when we called it just machine learning and not AI, kind of how do we think about the ways that people engage with these new technologies, like more the HCI.

point of view or like ethics or how do we actually build things that are non-deterministic that also help people. And so that probably started when I was part of a small

like boutique design consultancy called Philosophy. And a lot of that work was working with very large companies that needed help. This was like kind of the era of the innovation lab, if you remember that time. And so that then took me to a couple other places. I ended up at Google doing the core machine learning group where I worked on like strategy operations and special projects. And then there were a couple like reorgs for that team and then saw kind of like serendipitously through the product ops community that there was a need for a product ops

Chris Butler (02:00.303)

person at GitHub. And so when I joined GitHub, I was actually the what we call portfolio ops, but it was essentially for all of copilot inside of GitHub. And I was also the lead responsible AI champ for all of GitHub. And then about a year ago, I got an opportunity. We had put together a couple small experiments just on the side using AI to help PMs.

something we can talk about a bit more, was called, was codenamed Bloom. And it's still kind of like lives on through a lot of evolution. But that allowed us to create a small team called Synapse, which was me and a couple kind of engineers that would go off and think about how do we reduce toil for our cross-functional teams, but in particular using AI. And so this was still pretty early inside of GitHub. There was more usage, and GitHub was like the first group to actually come out with like code suggestions and stuff like that.

which came out of our GitHub Next team. But really just thinking about how does it change the way that we do our process automation was a key component of this team. And so over the last year, we've built real tools that people use with process automation. We've had to build things, throw them away, and then build new things over and over again. But where we're at now, I would say there's a lot of things that I think a large group like GitHub, where there's so many different groups that are all executing very rapidly in a reactive

type of environment, the technology landscape is constantly changing, there's lots of competitive threats. We have a real need to try to at least simplify and reduce the busy work or the toil that people end up doing to organize at that scale. And so a lot of the work I do today is about how do we apply sometimes agentic workflows, sometimes just coding actual deterministic processes, but a lot of it is how do we help people not have to do as much busy work.

kind of boring, toilsome work in their day-to-day.

David J Bland (03:54.134)

Amazing. It's such a credible journey. I think with GitHub is it's it's kind of funny. I mean, I had a dormant GitHub account for so long. I used to code a long time ago at tech startups and then it just wasn't part of my day to day. And then with us five coding stuff, I was like, man, I'm here in lovable when I was like kind of want to disconnect this to GitHub. And I was like, wow, I'm actually like open sourcing stuff in GitHub now, which I find, you know.

Chris Butler (04:04.055)

Yeah.

David J Bland (04:19.801)

That's probably not unique to me. I feel like a lot of people are having this moment at the, know, like, oh, maybe I need to connect this to GitHub. Are you seeing stuff like that with this resurgence of like vibe coding?

Chris Butler (04:29.761)

Yeah, I definitely, mean, I have like an engineering background and when I did my own startup, like I'd spent about a third of my time coding. But like the last long time I've been trying to push the narrative that actually PM sometimes it hurts them to be more technical.

And now all of that is collapsing. Like I actually got in trouble at a, it was like a Google offsite where I had said that like, PMs are not here to be technical. And a lot of PMs that were inside this team had like PhDs and like VLSI design and stuff like that. And so it personally offended a lot of people there. I actually had to like send out an apology message. But like in the end, like I do think that there is something about.

what should a product manager understand? Or if we're gonna even decompose the roles further, we're gonna say like, what is the person that is trying to make sure we're building the thing that is right for the customer and the business, right? And we know a lot about the customer and we're trying to actually like reduce or wrangle the uncertainty of the world. We're trying to make sure we build alignment. Like all that stuff is part of that role. And now we have to understand like, what is technical enough? Or what does it even mean to be like...

technical to be able to use these tools, I think is a really interesting question. I would say that reading code is not necessarily one of those things. But understanding maybe architecture or when you should say this is probably a bad way to do it, that's a hard thing for me to figure out right now. yeah, it's definitely I'm wrong, I guess, is what I would say about PMs not being technical.

David J Bland (05:53.274)

I commented on this Claude LinkedIn post a few weeks ago or several weeks ago and they did this, you know, DVF kind of thing with the three circles and they had it linear at first and it was like, well that's, you know, we've been trying to do these three circles for a while now and I feel like it's crazy on LinkedIn right now. I think I had like 40 some thousand views on just that comment because I was one of the first comments on it. And some of people were like laughing at me and then some people were generally like.

Enjoying the comments. I don't know if they thought I was not being serious, but I was just trying to explain it. Like, hey, we've been trying to do these three circles for a while. You know, those three circles like DVF, kind of like desirable, viable, feasible, and then PM, so interesting to me, right? Depending on the region of the world I'm in, if I go to EU, it feels as if...

know, PMs are not necessarily caring about viability as much, whereas here maybe they are. I don't know, maybe it's just a company that I'm speaking to. But when you think of those three circles, I usually think PM really helping with viability and sometimes desirability too. Not as much with feasibility, but I mean, are you seeing that changing with, you know, all this resurgence of AI and just these circles collapsing?

Chris Butler (06:56.525)

Yep. Yep.

Chris Butler (07:03.115)

Yeah. I mean, I do think that one of the problems that we see is almost any like vibe coded.

product that gets famous immediately has security issues, immediately has scale issues, immediately gets taken down for like a million different reasons. And so I think what it's showing is that, yes, we're trying to like hand wave our way away from that idea of like the, you know, feasibility type of thing, but we really still need it. And so that's why I kind of say that I don't know if like the PM role is going away necessarily. I don't also don't think that the engineering role is going away. I don't think the designer role is going away, but I think we're going to decompose the set of tasks that they have to do where we're going to have a machine

do certain tasks. But I think I've always found that like it's really important even in that like trifecta, the reason why we have that trifecta is because of tensions between those different people.

Right. And so I think we can start to simulate this. And that's actually some things that I've started to build is like simulations or proxies of like roles inside the team to help make sure that the PM understands say that like, this is actually a problem that we should consider from a security standpoint or privacy or something else. But I think there's still a huge value in like requiring multiple people and multiple viewpoints that are maybe simulated to get to a better spot. So I think that tension is really still valuable. Even because it's like the

idea of the program manager inside of Microsoft was always the product manager and the project manager. And it meant that you, on this tension spectrum, you would always make the wrong decision.

Chris Butler (08:27.489)

basically, because you would be holding both roles in your mind and you would never actually do the right thing for the product manager or the right thing for the project manager. And you would kind of have this inner fight, but it wouldn't, it wouldn't actually like help you do anything better. Like there's a great book called the enigma of reason that I keep on like telling people to read, but it talks about like how reasoning happens and it doesn't happen inside of our heads. It happens between people talking. And so I think that's something that is very valuable. I do think like, what's really interesting about these new tools is that you're almost kind of like,

externalizing your kind of context and knowledge to something else so that you can then look at it as like a mirror. I almost call it like rubber ducking 2.0, if you're familiar with that terminology. But it's essentially just like you talk to an inanimate object, but it's kind of more interesting in the way that it can repackage your thoughts so that you can react to them. And so I think those tensions and that kind of mirroring and kind of reasoning between different people is still very necessary today and will be for a while, I'm pretty sure.

David J Bland (09:21.496)

Yeah, think we have to be careful about removing the friction. I think John Cutler was talking about this the other day. Removing the friction out of everything because there is value in the friction and the triad or wherever you want to call them, know, the DVF circles, whoever's responsible for those. There is value in the friction between them. And I think if you just make it as efficient as possible, I think we're going to see that we're going faster, but we're not having the impact.

Chris Butler (09:26.465)

Yes. Yeah.

Chris Butler (09:33.143)

Yes. Yeah.

David J Bland (09:48.438)

that we expect it's because we've removed all the friction. It feels a bit counterintuitive to me.

Chris Butler (09:51.841)

Yeah, yeah, yeah, I mean, I do think.

You know, one of the things we discovered, we can talk about this more, but like, you know, there is a need for people to slow down and think still, right? We still need to have like ideas marinate. We still need to have kind of a collection of options and an argument about those options in some way. And there is a decision that's made and the humans are really the only people that can make decisions. I would argue that like when you say that a machine like a machine or AI is making a decision in some way, you've really just set up like a manufacturing plant for decision-making, but it's just an automation. It's not a decision-making.

capability in the way that like humans really intuitively make decisions. And so I agree with you. think that slowing down, it's not necessarily slowing down, but it's like finding the right ways to actually protect the time for discussion between humans, for people to understand each other. Those are the parts of decision-making that are really, I think, powerful in some ways.

David J Bland (10:45.942)

Yeah, it's like you're describing a decision factory in a way.

Chris Butler (10:48.405)

Yeah, well yeah, yeah, exactly.

I mean, I've done a lot of work with this co-founded community called the Uncertainty Project, which is really just all about decision-making. Yeah. And so, you know, I'm an advisor also to .work, which was the group that kind of a lot of people from there have been writing for that. And there's one talk, there's one like discussion we have in there about like this idea of the discourse around a decision is actually the really important thing. Now you can do a lot of great stuff with AI inside of that discourse, but you really need to be getting people to at least, you know, at least start to understand what are the options and,

David J Bland (10:55.441)

yeah, yeah.

Chris Butler (11:20.295)

creating those options and throwing away those options. And if you don't have that place to do that, you end up kind of like just rushing through the decision process. And I think you make worse decisions in that case.

David J Bland (11:29.56)

Yeah, I think decisions are still human. Well, they should be. There's still there's accountability involved. There's a social nature is how we make decisions. And I think we do have to be mindful of giving that power away. think what I've seen, I don't know if you've seen this recently, but I've been just building like vibe coding a little like tools to do a very specific thing, you know, and one of the workshops I did, we had like a couple hundred people in there and they, you know, I could roast their stuff, but I built this little roaster tool.

Chris Butler (11:32.321)

Yeah. Yeah.

David J Bland (11:59.831)

And what I noticed was it gave similar advice that I would give to them, but it was like they found it humorous because the tool was giving them advice instead of me, but it was my underlying, you know, kind of algorithm driving that advice. Do you see that happening in your work where it's almost as if, yeah, someone can give me this advice and I might actually have a hard time with it because it's a person giving me advice and all the social dynamics and everything and hierarchy involved. But if it's a tool, it's like, maybe I'm

Chris Butler (12:03.168)

Hmph.

Chris Butler (12:17.259)

Absolutely.

David J Bland (12:29.814)

more willing to take that advice. Help me understand that.

Chris Butler (12:32.192)

Yep. Yeah, so we've built a bunch of different, like at first it was kind of just prompts that would help people with like, say a specific, like a spec or initiative brief or something like that. But it's turned into a lot of other things, but essentially this idea of like the Rude Q &A is actually very valuable, right? And it helps.

elevate things that maybe you're either purposely or kind of not purposely like kind of biased against thinking about. But the reason why it does work is to your point is that there's no reputational risk.

Right, like, so if I got this feedback from like a peer or from a leader inside my team, I would feel like devastated probably. Now you want it to hurt a little bit, like actually good feedback does hurt a tiny bit. It's more like a sparring partner than like an actual like street fight, I would argue. Like you want it to hurt a tiny bit, but you shouldn't be injured. Like there's a bunch of stuff like that. But it is a new dynamic that like you can get this type of rude Q and A from like a prompt or an LLM or something like that. And it does actually help. People do laugh at it.

right, but they also kind of think like, I should probably have an answer for this, right? Like, and so it gives them a little bit of preparation. I do think like this like cycle of how PMs end up like polishing specs or previously would polish specs like over and over again, they would feel like really worried about publishing it to the rest of their team because of that like reputational fear. Now they can actually get much further towards a polished stock, but here's kind of like the tension or the kind of paradox about this is that that also means that when your team gets like a full

polished doc that is ready to go. They don't feel like they're part of the actual alignment steps. And so I this is something I've been thinking about is like, how do you actually show the work that you did to show that you actually threw away ideas or that you like said that we're not going to do this? And so it almost becomes like a decision log. But I think there's something about that idea of building up context. it's, you know, everybody keeps on saying the PRD is dead. And I agree with them. Sure. The PRD is dead, but long live the PRD context. Or what do we want to call it next? Right. Like there's this kind of

Chris Butler (14:28.672)

ball of information that is everything that we've been thinking about and doing, what we've tried, what didn't work, right? And that context is all, it's basically like the embodiment of the initiative, if you want to think about it that way. And so it gets collected over time. But.

what you do want to do is you want to kind of, again, have this balance between I want to be able to prepare my thinking and I want to solidify my thinking, I want to strengthen my thinking maybe before I give it out to everybody else. But there's also a real benefit, especially for small teams, that you're sharing these ideas and you're actually maybe using these tools together to get that root Q &A as a team rather than just you as an individual. And so we've done some experiments around that inside of GitHub as well, is actually using these tools. And it's very clunky, a lot of copy and pasting,

like pull the live transcript from Zoom, like a bunch of weird stuff that we would have to, we have to wait around for a minute or two while prototypes were developed and stuff like that. But I think that that type of stuff is actually pretty interesting. And so.

there's a value of that. mean, do you remember when there was like the one hour, one page movement? If you remember that at all. Yeah. And so that was, that was meant to like stop us from Paul over polishing stuff. And no one really did it for a very long time because eventually someone would see like a one page, one hour doc. And they're like, this doesn't seem very well put together. And so they would get in trouble basically. So yeah.

David J Bland (15:29.612)

Yeah, yeah.

David J Bland (15:48.034)

So I love that you mentioned testing inside of GitHub and I would love to dive in a bit there. You mentioned some earlier experiments for product managers, maybe starting there. What were some things you were trying to test inside like early days when you joined?

Chris Butler (16:03.712)

Yeah. Well, so yeah, I guess like the first thing was this kind of idea of

PMs, don't get feedback early enough to be able to get something in a better state for themselves. And then also there's a lot of toil in, say, a place like GitHub, there's lots of downstream go-to-market compliance teams where I have to actually take what I'm building, what I've already created as maybe a spec for an engineer, I now have to transform this into 15 different documents that have to be approved through the life cycle of go-to-market. And this is because there's

there's a lot of compliance, like again, legal privacy, responsible AI, accessibility, commerce, like revenue enablement, docs, FAQ, like there's all these things that you need to like get together to be able to, you know.

have this machine continue to work with so many people. And so really the early way that we did this is we thought that, well, I've been using a lot of different prompts to help me kind of like create stuff. But a lot at that point in time, you had to like copy in the prompt and then copy in your content and then have the conversation. Cause this is kind of like pre, you know, chat GPT or custom GPTs, you know, things like that. Even in our case, like we just had co-pilot, we didn't have things like co-pilot spaces where you could start to collect like instructions and stuff like that into a.

usable way or skills or something like that. And so we built out the first prototype is I thought we really wanted to have kind of a discussion format. Like it would be that I get all this feedback, I respond to the feedback, it updates the context, but then it also then auto generates and drafts a bunch of things, which I could then also give feedback on, which updates this context.

Chris Butler (17:43.692)

And so it was meant to build this, kind of almost, it wasn't an invisible item, but it was like a JSON file of all of the responses and content that the human had put together. And so we built this inside of GitHub discussions. You would copy and paste whatever you had about the initiative into a discussion post. There would be like, at first it was a lot of comments and it was like, if you printed out the number of comments, it was 15 pages long. It was really bad at first, but like each one of those comments, one of them would be like, where are gaps in what we usually expect for an initiative?

you need to add here? Or what things are in or not in strategic alignment with our current like strategy, right? What would a famous strategist say about this? I even had, if you're familiar with like oblique strategies, I had like an oblique strategy interpretation, which is like a deck of cards, but like oblique strategy interpretation, everything. And the ones that were actually most interesting to them was one, a premortem.

version, which would basically give them three headlines of a future failure of this particular initiative. And then the Rude Q &A, which was again, a very sarcastic, mean lead engineer style voice of like analysis of this thing. so people, know, we tested this with people. People really loved the feedback that they were getting. We tested it via kind of qualitative interviews. I'd have people at first it was, it was all kind of like me copying and pasting these prompts everywhere. And so I would get their initiative and I create the thing for them. And then I watched them.

React and what they would do in this case. And even the idea of the outputs, they were just like markdown files that were generated inside of a GitHub repo. And so what we found though is like, yes, people found the feedback very helpful.

when we had our first version of this thing, the people that used it would save 45 minutes trying to prepare a bunch of documentation per initiative, essentially. And so that was really interesting. We didn't find, that people, even though they said they liked the harsh feedback, they didn't actually engage with it very much. And so we had a bunch of ideas about why that might be. And it might be because it was in the discussion format, for example. Maybe people wanted to have a bit more hand-holding. Sometimes people would just...

Chris Butler (19:48.17)

paste in a paragraph and that's then get all this feedback about this like one paragraph. And so from there, we started to build things like GitHub Copilot Spaces, which is again, a bunch of prompts together and some instructions about how to take someone through that kind of like step by step. We've been starting to integrate it with other kind of automations, but also like we built a skill that is now distributed through we have like an internal GitHub extension that has for all of the internal skills for GitHub. And so now you can add this and through the CLI,

You can have that back and forth and it help you auto create the issue in the right place. It will help you generate these documents into other repos where privacy or legal might have their own repo around how to do this. that's kind of the evolution of this over time. that's, yeah, we failed a lot. still, it's like people say they like that feedback, but they don't actually use it as much as we thought they would.

David J Bland (20:41.823)

Yeah, how do you balance the kind of the what and the why or the qual and quant there because it feels as if you have the ability to build almost anything inside GitHub and then you're like, okay, I'm watching how people are using this and I'm wondering why. How are you making space for that? Are you going to talk to them? Do you have a team? What does that look like?

Chris Butler (21:04.351)

Yeah, so we definitely we build a lot of stuff to like I think a lot about engagement with the tools that we build and.

I think where we've started to go with this agenda stuff is, yes, we can create all these great artifacts. We can create reports. We can update an issue in a certain way. But if we're not seeing engagement with that in some way, and that could be reading it. So I have a way based on this is for internal only stuff. But I can see who actually looks at my discussion posts, for example. And kind of like Google Docs, where you can see the activity tracker and everything. It's very disappointing. That's what I would say. Almost by default, it is disappointing, is what I would

David J Bland (21:39.927)

Chris Butler (21:43.1)

say. So I think like definitely using engagement as a signal, not as like a failure signal, but as like, hey, we haven't gotten it right yet.

Right. Like that's how I think about this stuff. So we were able to track like how many people, how many, like we were always tracking how many discussion posts were being created. Was there any comments being put on them? And then after that, it was like, how many conversations are happening with the co-pilot space? And then after that, we're trying to figure out how to now like understand the engagement with the skill, trying to discern like, was this created with assistance of something? but I, I actually tend to default back to qualitative. That's just because of like my background. I, I find the ability to be able to ask people why.

Now again, like I've done a talk previously where, and other people have said this too, it's not just me, but like sometimes like bad research is worse than any research, no research is what I would say. So that is true, right? And I've definitely tried over the last like long time to hone my ability to do good research. And I've learned from a lot of really great researchers about how to do that, right? But I tend to go back to qualitative. I think that the problem that we will have in this new kind of era of vibe coding is that,

You know, and again, I don't mean to demean it by calling it vibe coding, but like the idea of like PMs being builders means that we, I've found that at least we don't take the time to actually talk to people as much and do that type of qualitative interviewing. I think that's bad. And I think that like one of the biggest benefits, right, and this is a pattern that I use now is that.

I can create something very quickly. I can then jump into a call with someone, show it to them, get that feedback, use the transcript to then give me kind of, you know, the beginning of a synthesis about like what to do next, essentially. Like these group meetings that used to, I used to have to like spend my entire time just like taking notes about the critique of the thing I was doing. Now no longer requires me to take those notes. I still need to read the output of the transcript. I still need to use the transcript to be able to pull that synthesis out in some way. But I think that is

Chris Butler (23:41.76)

That's what I think we're just not doing it as much because we feel like if we just get it out there, we'll learn from all of these quantitative signals when the reality is we should be using these things as a way to get human response, which is beyond just the behavioral side of it. All the quantitative does is behavioral. It doesn't do the all the, do I think about this thing? What's the mental model inside my head about how to do this?

David J Bland (24:05.172)

Yeah, I think something you said there really fascinating was you have practiced and learned how to do better research over time through trial and error. And we had one of our mutual friends on Jim Morris on here recently, and he was talking about, know, vibe coding things. And that moment where you put it in front of someone is very important. And if you haven't been trained or you don't know how or you don't have a script or maybe you don't even have a way to frame it.

Chris Butler (24:12.927)

Yeah, that's right.

Chris Butler (24:17.643)

Yeah.

David J Bland (24:34.166)

It tends to be a, well, what do you think? What do you think about this? And you laugh, but I mean, it seems to be happening where, you know, we'll vibe code something, we'll put it in front somebody, what do you think? And it's almost like one of the worst things you can ask someone because that, you're not tying this back to a hypothesis or an assumption that, you know, why am I creating this? Like, I feel as if something about our site, like our loop is broken there right now, unless we can help.

Chris Butler (24:36.907)

Yeah

David J Bland (25:04.168)

educate people about what do do when you put that vibe coated thing in front of somebody.

Chris Butler (25:08.683)

Yeah, I think that's, so Jim spoke at Product World kind of like right after me, I think, when I was talking a lot about this cycle of using prototyping to do exploration as a PM. And he did come up with a pretty cool tool to be able to get an actual good interview script for a prototype that does try to think about the hypotheses you have. And I do think the questions we should be asking ourselves much more are like, you

It's definitely not like, do you like this, right? Or will you pay for this? Those are both horrible questions to ask inside that thing, but it's really more about how do you understand what's going on? you, how do you intervene? And so I agree. I think there's a lot of people that people can learn from today to be able to ask better questions. But I think there's not a lot of infrastructure.

at least for process and culture built up around that. Like this is a prototype that we built this week. Now we're going to test it. And if philosophy like that, that was one of the things that I loved that philosophy is I had to operate on a week by week basis where we were doing weekly sprints. We would create a prototype. We'd at least talk to five people by the end of the week, and we would then figure out what to do the next week. Right. And so that that process was grueling in some ways. Like it was good that projects only lasted like eight weeks. And usually we'd only do about six weeks of that within that project. But I think we need to figure out what

Maybe that's another interesting thing about this is like, what is the cadence for work? Right now, there used to be down times where we would kind of be like at the end of a process or at the beginning of a process, and it would be time for us to think and reflect. But with this, like there is a hurry up kind of mentality that I think starts to eat up all the time.

And it's something I've found, I've started to see that like PMs when they're in highly reactive mode, when I am doing this, I just think about like executing this thing rather than actually try to figure out is it the right thing. And so I've been like, I think of myself as like a facilitator also, that's something I really love doing, but I've almost like shied away from it now because I'm just like so focused on the building stuff. And so I've been trying to like bring myself back into like, how do I facilitate a conversation between people, maybe with tools?

Chris Butler (27:14.788)

maybe with prototypes, but like how do we start to integrate the actual human conversations a bit more into this stuff? So I totally agree with that.

David J Bland (27:22.422)

Yeah, I was doing a series of interviews recently from executives and one of them really stuck out to me. can't remember who I was speaking to, but he said, we want to go fast, but we don't want to be hasty. And I thought that's a really interesting way to frame this. Like, yeah, speed's important, but not at the expense of, you know, other things. And I'm wondering, you know, coming back to GitHub and your work there, it sounds like you're doing a lot of

like identifying processes, trying to reimagine those with AI and agentic stuff. Maybe you can talk, anything you can share with us publicly. what's your thought process about looking at opportunities inside GitHub, like how you're testing them? Like, just give us maybe like a high level of how you're approaching them.

Chris Butler (27:57.631)

Yeah. Yeah.

Chris Butler (28:08.542)

Yeah, think one of the things I'm

working on right now has been how do we understand the health of our release pipeline in a much better way? Like what's risky, what's not risky? How do we remove some of the siloing that happens between the team that is building the thing and the team that has to market it and the team that has to sell it? How do we smooth some of those problems over? And so we've been using a technology for the last couple of months. It's in tech preview right now called agentic workflows, which is from the GitHub Next team.

With this technology, it's actions, like GitHub actions under the hood, which is usually used for like CI CD type of use. But it also includes like instructions plus this ability to do like a safe output so I can restrict what this what this actual

agent can do within the environment. so starting to build out like some of the problems that we have identified when it comes to understanding our pipeline is one, you there are sometimes things that get all the way up to their release date, but don't actually are not identified as being risky before that moment, essentially. And so how do we start to be more proactive about looking at all of these signals, which in some cases can be deterministic, right? Like it could be, well, we haven't actually assigned a PMM to this and we're like two days away and it's a tier one, like that's probably bad, right? Like that's that, that means that we're

we're not gonna do everything that we should. But there are other things that are kind of like, know, the comments that have started to happen on this release, like they usually should ramp up as you get closer, they should be more kind of substantial about like what is going on inside of them. And so more subjective use cases are where these agents are really great. And so we started to build out kind of...

Chris Butler (29:45.507)

rubric of what is risk within these release tracking issues. And what's really exciting and interesting about the way we build building on this technology is that there are parts where we want like a report or we want a comment on that release tracking issue or we want to tell the PM via Slack, you have to go and do this thing, right?

But the thing that always comes up is if we built this out as just like code, people would, we'd have to like write documentation after the fact about what the policies are when it really should be the other way around. should say, we should write in natural language, what are the policies, right? Like what are the policies we all agree to and be able to reference those in such a way. But what we can do now with these is I can create a single document that is essentially here's all the policies that we care about. Some are deterministic, some are non-deterministic or subjective. And it's agent readable, it's human readable, and it can be used by skill so that people can all

do their own analysis of their own releases. And so I think I kind of been starting to call this like process as code. And so the idea of like the how we work document for each team is actually the thing that is happening. And as you change the way you do your work, right? Like, so you have a conversation in a meeting about a new process, like it should be flagged that like, this is not being, this is not documented yet. And maybe we should automate this.

Right. so anyways, that health tracking one is one that I've been kind of working on like the last couple of weeks. And it's very exciting the way that we can start to pull out here's the process and the, not the process, but like the kind of the coding, like how we would get this done. It's no longer, I have to write a document that I submit to stakeholders and get approved. And then I have to like plan to execute and then execute. Like I just, the first thing I write is this like policy document or this rubric.

And once we get agreement across all that, there's lots of different ways we can start to like generate interesting outputs of that rubric based on the actual artifacts. And that's maybe the other thing that I think is really important is that for a lot of this agentive automation, like.

Chris Butler (31:37.853)

It really is only about the artifacts that humans deal with. Like it doesn't matter what the agents create in the kind of between the steps that they are doing. What matters is what is the artifact that is output that a human will engage with in some way. And so in our case, like those issues are the core artifacts that are really important for us to, kind of continue to have as the source of truth. It doesn't matter if we create other data sets for analysis and stuff like that. It's the artifacts that, that, that issue set that is really valuable. And so I think that's another thing that is really important when we start to adopt these

agentic technologies is that

It should really be about the artifacts that they're built they're working on that humans use and I would say that when we talk about any of these things It's really about a graph of artifacts right like issues are connected to other issues. They're related through projects They're related through initiative sets or something like that And so agents can also go in there and start to do a better job of annotating relationships that are not as clear or siloed So another thing that we've been working on is like release tracking issues and connecting that with customer feedback Which is like what's the total ARR that this release will?

get us potentially from the sales team, right? And so we have, because the sales team and the customer support team talk about features in a very different way than the product team talks about features. had to, actually one of the things I did when I was on copilot was I had to create like the copilot feature matrix, which was just like a ridiculous spreadsheet at the time. But it was so hard. It was so hard to build because everybody had a different set of features depending on who you were talking to.

And so linking up that taxonomy was one of the hard problems that we had to do. And we have to use actually LLMs with prompting. And so we use a combination of semantic matching, semantic search, which is more about the words. But then there's this level of meaning, which we can start to say, does this actually make sense as being the same thing? And should it be connected? And so I see agents as being that thing that is really going and

Chris Butler (33:30.362)

like always going around your graph and connecting and annotating information that maybe it would have taken a human a very long time to read all this stuff and be aware of it and it would be stuck in their head and they would not like output in a way that was helpful. So those are kind of like some cases that we've been building inside of github that i think are pretty interesting.

David J Bland (33:48.156)

I have so many questions. will pause it and let you grab a drink here. So a couple of things. One, I don't think we talk enough or it hasn't been written enough about how we apply this stuff to process improvement. Even before AI became so popular, I was thinking back almost like when Lean Startup first came out, one of the corporate officers of GE read it and Eric was brought in. Eric Ries was brought in.

And he was like, hey, I need some help here. You know, we're going to do this retreat. And he tapped me to come in and help. And I realized we had, you know, some teams that were building products at GE and obviously like they're building manufacturing and there's all that complexity. A lot of it was process work though. It was like.

our hiring process needs to be more efficient. How would we use almost like lean startup principles to improve our process on hiring? And that was just one of the many, many different things we helped with. And I thought, someday I'm gonna write this really in-depth thing about lean startup for process improvement. I never did by the way, but this idea, maybe because it's just not sexy, I don't know. But I feel like sometimes, yes, you can make money by launching a new product or a new feature.

Sometimes you can make money by just making what you do more efficient internally and serving your employees better so that they in turn can serve your customers. I feel as if, know, back then, of course, we didn't have a lot of AI that we could leverage like we do today. But I feel as if you're really onto something here inside GitHub where you're looking at these processes and you're saying, hey, how do we look

and expose risk sooner versus later? And how might we use AI to help sort of guide that process? I just feel as if, I don't know, it just brought me back to those days where I was like helping the map it all out manually. I was thinking, man, if I did that again today, that would look very different.

Chris Butler (35:36.798)

Yeah. Yeah.

Chris Butler (35:45.962)

Yeah

Chris Butler (35:53.428)

Yeah. Yeah. I mean, the, I think there are two things that you get with automation. Like one of them is that sometimes people are the process.

So like the TPM that is sitting there with three spreadsheets and the project view and is double checking all the fields, like that is something that was happening like a year ago, right? Like at the very least. And so yes, that should be more efficient. That shouldn't waste their time. They should be there like understanding the risks within the team in some way. But there are also things that we would just say like, well, what would happen if we had this functionality, right? Like what if we had this type of process or this report? There's a team that I'm working with in

which is trying to pull together a lot around customer feedback and they want to do a better job of actually marketing this highly technical thing. And so one of the reports I just completed this week was looking at everything that you've just launched, looking at the blog that GitHub has, looking at all of the stuff that we've been talking about in social media, what a customer has been saying about us, give me two to three blog posts we could write, like topics, and then give me five to 10 social media posts that we could do.

And we're not writing the whole thing, we're not auto generating it, but at least like the bullets, like for each one of the team members that is part of this, they have like big followings within the developer community. And so like just helping them to like have the fodder of like, what should I post on LinkedIn today? Or what should I post on Twitter today? Or like, what is a really important thing for us to talk about with regards to like the big news inside of GitHub? How do we like attach to that with this like new technology? That's something that like we would have had to like try to ask for a PMM to join our team.

None of these people, including me, are marketers. And so really, it's more of a grassroots, like, hey, here's at least some ideas to get you started. We could start to then from there, it would have taken a lot of work to be able to get to something like that previously. So wouldn't have been work that would have been done. And then that meant that it still would have been more ad hoc for them to then decide how to talk about things as a team, because they're busy building the technology. This makes it so it's like the blank page problem a lot of time.

Chris Butler (37:57.994)

And so anyways, I see those as two different things that process really helped give you is that yes, we should be removing toil, we should be making it so people are not the process, we should be doing all that type of stuff. But then on the other side, there are things that we just think might make things better, but we don't know until we try them.

And so this is an experiment. this social account, like the social list, social media list is just an experiment right now. We'll see. If there's no engagement with it, like no one reads it, no one picks up the topics and actually posts them. And then I'll know that maybe it wasn't a valuable thing, but this is like a super easy thing for me to put together for them based on all of the context that we have inside the team.

David J Bland (38:33.365)

Yeah, that's incredible. I love that you're treating it as a test, right? Like it's a small test and if it works, we'll dial it up. If not, dampen it down and it's contained.

Chris Butler (38:37.491)

Absolutely.

Chris Butler (38:42.305)

Yeah, I'll create like a whole social media tracking project with like issues and everything. If it works, like, and we need some more coordination, like everybody's saying the same thing or something like that, there's a of problems that could come out of this, right? And so we just don't know yet. That's the thing. And so it's so easy for us to just get that report in a way that like is actually helpful. And so I think it's helpful, but we'll see if the team actually uses it.

David J Bland (39:03.753)

Yeah, I love like it's just like we'll test and learn. I mean, that's that's what we do. When you look at when you look at all the different things, it feels like there's a lot of energy. There's a lot of excitement around this inside inside GitHub. I mean, how are you all at a high level thinking about what opportunities you go after or not? As far as do you have like a process at a high level? mean, is it just hey, go do your thing and ask for forgiveness later? Like what's the culture look like?

Chris Butler (39:06.717)

Yeah. Exactly.

Chris Butler (39:33.234)

Yeah, I would say that the GitHub culture is very like...

Every team kind of works on their own and so that is something that we've been trying to like help fix a little bit is that sometimes there's not enough coordination between all the different teams like I would joke that for a while that like github was like 20 products in a trench coat essentially and It's in good way. Like there's a lot of stuff that you can do with github, right? like So that is something that and that's why like this like idea of centralized release tracking centralized initiative Helps us kind of understand a bit more apples to apples and so I would say though that like the the big problems that we need to

solve are about how do we do a more, I think it's how do we do a better job of like marketing our or like the stuff that we're building. And so a lot of the work that we're doing right now is really focused on that. And but I would also say that like I'm one of those people that always like has like another idea about a problem that I'm trying to solve. And so I definitely spend part of my time doing things that are like seem ridiculous in some way and maybe not helpful. But

I think that's important, right? Like the truth is, is that even when we can maybe potentially in some near future, we can understand that, like we can understand which people are in alignment with the current strategy and which people are not in alignment. I think there's a value to have a budget for experimentation. And I'm sure this is something that you talk about all the time, right? Like, but like if we don't have that experimentation, like we should be making bets on things that are orthogonal to where we believe we should be going today. And the reason why is because we need to know about them.

And we need to see if they're actually true. Right. And so that to me is the reason why I'm kind of a natural anti-authoritarian like leadership type of thing. It's just like natural propensity, unfortunately. But like that's that's why I'm always asking that question of like, well, there's something over here that doesn't it seems novel or doesn't make sense or just seems cool. And we need to try it because and we can do it much cheaper now. Right. Like.

Chris Butler (41:31.228)

One of the things I did that I thought I still really like as an experiment was, you know,

we tried to run some kickoff meetings where we using these tools in the kickoff meeting. And it was really clunky. was like annoying to use like co-pilot spaces. Plus at this time, like it was co-pilot Spark, which was like our prototyping tool, plus a bunch of other like prompting systems, plus Zoom transcripts. And it was just like really bad. was a bad meeting. Everybody was a good sport, but it was a bad meeting. And we did learn interesting things from that. And so I did put together like a prototype of like, what would it look like to invite agents into the meeting that we were using.

but that would actually be autonomous in some way inside these meetings. And definitely it's not them speaking up, by the way. They should not be speaking up at all. But the idea of like tracking what is going on in the meeting actively, pulling together like possible plans, having different proxy kind of viewpoints, bringing up.

key points, and then even the idea of like, okay, so we're doing a kickoff meeting right now. There's a certain point where it's like, well, let's go look at a prototype and that prototype is available to us. And so it did all these kinds of things inside this prototype. It was very easy for me to like experiment and look at that. And it it seemed very valuable, right? Like I, was, was harder to understand, like how do we, the big thing that we learned was that actually the engagement of the humans was the most important part. And so that could be verbally, like we start talking about the thing that is up there.

But it could also be like, you you could get into crazy places where it's like eye gaze, where is it, where are we looking on the screen? But this did turn into another experiment, which was I had a problem that, you know, we would go through like through our weekly sync meeting or triage meeting. Basically, we'd go through the list of the items that we were working on. And then we would get like a recap transcript, but then I would have to like go in and update every single item inside that project board. And so started to pull out the transcript. Plus I created a little Chrome plugin.

Chris Butler (43:24.349)

that would track my dwell time on different web pages, because I would present to the rest of the team, basically. And so what I could do is with that transcript and that dwell time report, it could actually update each of those issues with the specifics that we were talking about of that issue automatically, rather than me having to go in and do that. And I'm assuming that like in some near future, we can also just, I can get rid of my dwell time thing, and we can actually just do OCR on what I'm presenting on the URL of the page that's in the video, right?

And so it's like those types of things that I think it is really important for us to keep saying that like, this is an annoying, inefficient, toilsome job for me to do. And how might I make it better? Because it's, again, the recaps, everybody is coming out with recaps and none of them are valuable because they just kind of get thrown away. And so like, again, how do we focus on that artifact? And that's how I think about it. So what are the artifacts that people actually use? Yeah. So maybe I'm making more of a case for my weird experimentation than I am for.

following the big problems of GitHub, but we do both, is what I would argue.

David J Bland (44:26.504)

I love that you're focused on the people and it's almost like you have to find this balance of all right, what's this tedious thing that I do that I can automate and then where are the points where the friction matters and I need to not automate right away? You know, how do we keep the human in the loop here and the people at the core of this? Because that's what's really going to make your company successful. It feels as if

There's probably something there. Maybe you already have it, but it's almost like a criteria or something you're looking at and going, okay, I'm looking at this like quadrant almost of work that I have and here's what I should automate and here's what I shouldn't. And it just, feels like you're, being very thoughtful about it.

Chris Butler (45:09.596)

Yeah, yeah, you're right. is kind of, I don't know if it's a framework necessarily, but I would say that the things that we should be taking off of people's plates are those decomposition of actions and skills that are not necessarily helped by.

It's not, human intuition is not necessary, right? And I actually have like a slide where I differentiate between what are humans good at, what are machines good at? And machines again, are like good at doing things all the time, doing it repetitively, being able to like gather lots of context and sift through it. there's being able to understand patterns of things like, and we should be using it for that. Like, or translation synthesis, like those are all things that these technologies are really good at right now. But this idea of like human intuition.

the idea of human conversation as a reasoning tool, the use of facilitated group discussions. Those are things that I think are really key. And in the end, we should be having, and even if it's on a one-on-one basis, I think most people think about the tools today as a one-on-one. I'm going to use the CLI to go back and forth to build this thing.

I think that we need to get out, that mode is very valuable because people think by themselves a lot of the time and having this reflective kind of technology is very helpful. But then there's like the part that we were talking about maybe at the beginning, which is we need to now understand from a tension standpoint within these different roles, these different kind of beliefs systems, like.

how do we build something that is better? How do we reason together to come to a better conclusion? And so there is like, for me, it's very clear that, I guess it's like, not very clear, but I would say like, there is a distinction between like, what humans and machines should be doing. And it's usually like, anytime that we should be having a conversation, that's something where it's definitely human to human. Whereas like, you could simulate that with a machine, but you're probably not gonna get everything that you could out of that, basically.

David J Bland (46:58.1)

I like that, I like that. So where do you think this is all headed? mean, you're at the kind of the tip of the spear on some of this from my point of view, and you're really down in this day to day and you're trying to push this stuff forward. Where do you see this sort of agentic flow? How do we think about agentic flow? Where do you see this headed over the next year?

Chris Butler (47:01.448)

Yeah.

Chris Butler (47:18.994)

Yeah. Well, I think that right now, we're trying to figure out a lot about the way that we engage with these technologies and how we should build them. And so right now, it's a lot of people kind of building lots of stuff. Inside of GitHub, we have a lot of technical people. And so there's just been like a million tools that have been created internally. Most of them will be abandoned probably by the end of the year, unfortunately. But we've all learned some valuable lessons about

how we will work with this. I think there's still a lot of kind of the old paradigm of access. Like one of the biggest problems I have with building agentic workflows right now is actually like authentication and authorization. And so we have like GitHub tokens and there's like a way that you deal with these things. I really hated them and I feel like I know how to use them now, but I still hate them. But like all that type of thing, I think that infrastructure around how do we make these things appropriate for enterprise.

use cases, that is going to be coming out in the next like 6, 12, 18 months basically. I think there's a lot of good things. Like again, agentic workflows focuses an awful lot on like what is the right output for this. So this automation should only ever create one discussion post per run, right? And it should have this tag and it should start with this. And if it does anything else, like just don't allow it to do it basically. And so it's very restrictive. And I think that is the right way for us to go for a lot of this stuff. I think the...

way that we end up working as a team is going to change over time, right? Like we're, I think the idea of the transcript that it gets kind of like, it gets created, but then people don't use it as much. I think a lot of that kind of information flow stuff is going to become much more natural as tools get better integrated. But even in our case, like it's hard to figure out the right way to integrate the GitHub graph with the M365 graph, for example.

And that's where all of our transcriptions are, is in M365. So I'd say that I think there's going to be a lot of these paper cuts and kind of security concerns. We're starting to figure out the patterns for that, is what I would say. I think when it comes to the roles, there is definitely going to be a pendulum swing towards every PM building, right? And I don't think that's bad necessarily. But I think saying that a PM will build something that will go immediately to production, I think is the wrong.

Chris Butler (49:32.861)

wrong question to ask, or the wrong statement to make. We still want to have like experts in strategy, security and privacy and all this stuff. So I don't think engineering goes away because PMs can create these prototypes. But those prototypes do shorten a lot of the kind of the work that we used to have to do previously to get to at least something that we think is valuable. And then I think the other side of it is.

Right now there's lots of metrics that are being collected, but we're not doing a good job of linking the experimentation cycle with those metrics in a very good way. It's gonna be made easier because right now I can use a skill to be able to go and look at all the Cousteau data that we have, and it will do a pretty good job of actually even putting together a first report around it. But I think linking that plus the qualitative side, we still are building one-off tools or using platforms to kind of get people to actually

you know, like we want to get 10 people in today to talk to them about this prototype, right? Like that is still not quite right. Like in the way that it engages with the prototyping that we end up doing, right? So I see those, there's a lot of paper cuts and just kind of like still like kind of peace processes that are not connected yet. And so I think we'll start to see more and more of that over the next like year or so as well.

David J Bland (50:48.328)

Amazing. I mean, there's so much here that I just need to kind of process. I think it's a bit of a fire hose, right? When you think about all this stuff and all the things you're doing. If folks have listened to this and are like, wow, I'm struggling with this at my company, or they just want to reach out to you, like what's the best way for them to get in touch with you?

Chris Butler (50:52.808)

Yeah, totally.

Chris Butler (51:09.746)

Yeah.

Absolutely. So you can get to me on LinkedIn. I definitely read all my messages and everything. And if people do need help with this, I'm happy to come in and help your team actually put together some of this thinking, especially around learning and development of these new tools. But also just in general, I would say everybody should be trying these things out. They should be trying to use these tools. I definitely am a big fan of agentic workflows out of GitHub Next because I think it kind of finds the right balance between a bunch of different things.

are GitHub user, I would try that out. yeah, anyways, I would love to hear if people try these things out just in general, if these patterns are helpful to them. I would love to connect on LinkedIn and just see how you're using them and also how you're failing at them. That's super interesting to me too.

David J Bland (51:57.908)

So if you're listening to this and you wanna get in touch with Chris, LinkedIn is the way. We'll also put these links in the description. I wanna thank you so much for hanging out with me. I've been looking forward to this conversation for a while and I'll probably need another week or two just to process everything. But I just wanna appreciate, I appreciate you so much for just being transparent, talking about what's working, what's not and hanging out with us. Thank you so much.

Chris Butler (52:19.144)

Thank you for having me here. I really appreciate it.

David Bland

Chris Butler | How I Test AI Agents at GitHub

Have a Listen

Summary

Guest Links

Transcript

Courtney Honda and Slava Borisov | How I Tested Pet Retail

Chad Holdorf | How I Tested Pull Requests