Wednesday, May 28, 2008

Google's open-source balancing act

Google's open-source balancing act
Chris DiBona's job--manager of Google's open-source programs--is a balancing act.

Google consumes a lot of open-source software for its own highly profitable business. But as he oversees the search powerhouse's open-source work, DiBona has to ensure that the company reciprocates. It can't be all take and no give.



Chris DiBona, Google's manager of open-source programs

(Credit: Stephen Shankland/CNET News.com)
Free and open-source software advocates can be powerful allies--but also vocal critics. For example, some have critized Google for its lack of support for the Affero GPL license, which can require those using software for a publicly available network service to share modifications they've made to an AGPL software project.

DiBona thinks Google strikes the right balance, though, by offering its own modifications back to many open-source projects, advocating the philosophy in general, and trying to nurture the next generation of open-source programmers.

DiBona has been steeped in open-source software for more than a decade. Before his job at Google, he worked for Slashdot, still an influential virtual water cooler for open-source discussion. Slashdot was part of Linux server maker VA Linux Systems, which had a spectacular initial public offering in 1999 followed not long after by a drastic cutback.

DiBona will be preaching the open-source gospel at the Google I/O conference Wednesday--"open source is too good to be true and thus must be magic," according to the agenda--but I sat down with him beforehand to hear his view of open-source software at Google.

What's the view of open source within Google?
I asked myself, "Who am I trying to address?" The world of open-source business? No. The world of the open-source enthusiast? No. I'm really looking to work with open-source developers. We came up with these goals for our group: to support open-source development in general, which means to support open-source infrastructure; support the release of open-source code, from Google and in general; and to create more open-source developers, because especially when I started, there was a perception that Google took a lot of people from the open-source world and then went away. It was partly true, because people would come here and say, "Wow, I've been working on my open-source project forever, and I want a new problem," and we have a very good class of new problem. So they kind of went away.

That was too bad. The last thing we wanted as a company was to hurt the release of open-source software, because we consider it pretty important. We use a ton of it. Every engineer we bring on--how much open-source do they want to use? We have new packages and new libraries being brought into the company all the time. It's our group's job to track that. As we brought people in, we wanted to be sure more open-source developers were being created. So that's where we came up with the Google Summer of Code, and now we have a high-school flavor of that as well. I think we've made a very real impact in creating new people in the open-source world.

I'm curious about maintaining a balance between contributing back to upstream projects vs. maintaining your own internal forks. How do you go through that evaluation?
Google considers some projects more important than others. Obviously the Linux kernel is incredibly important. Every time you use Google, you're using a machine running the Linux kernel. We have a fairly large kernel team, and we employ people whose job is just to work on the external kernel. Andrew Morton is a good example of that. We try to make sure those guys patch out (submit their modifications to the main open-source project) whenever they can. It's usually more dictated by the engineer's time than it is any lack of desire on our part. I always wish we were able to release more, but it takes time for an engineer to do that. For the larger efforts, it's a little easier because there are more personnel on it.

The same thing goes for our compilers (software that translates programmers' code into instructions a computer understands). The great thing about our compiler team is they patch as a matter of their jobs. They're always patching out things from the compiler work we do internally to the outside world. We recently released the new linker, Gold--Ian Lance Taylor works for us on our compiler team. He's been on the GCC team forever. He used to be at Cygnus (a company that developed GCC). We have a lot of ex-Cygnus people.

Then there are Googlers who just want to patch into an existing projects. They found a bug, they want to add a feature. That takes no time at all. Our team looks at the first couple patches an engineer wants to send out, makes sure the engineer knows what they're doing with the outside world, then they're basically given free rein to do that. They keep us posted on what they're patching. We want to make sure our code gets out to the projects as fast as possible because projects keep on iterating. If you don't get your patches in, they won't get accepted, because they'll be too old or won't matter. If you've got a patch, getting it out there fast is better for us, because then as that project iterates and comes back into the company, we don't have to reapply a patch.

What are the most important open-source projects you ingest?
The kernel, compilers--GCC, the Python interpreter. Python is very important to us. Google App Engine--it's a Python hosting system, basically. Java is very important to us, and that's become open-source now. We have some very good Java people working for us--Josh Block, Neil Gafter--they've got a great handle on that technology.

Once you get past those three projects--the compilers, the languages, the kernel--then you go to the libraries. For us that's OpenSSL, zlib, PCRE. MySQL is hugely important to us. Past that, it starts tapering off pretty quick.

Has the open-sourcing of Java changed anything for you?
Not really. I think it had more impact on the outside world than for us. Java is a fairly mature language now. We've been using it for a long time. Before, it was the JCP (the Java Community Process to govern Java's future)--it had the rubric of openness around it. It was never really not so open. There are questions around what open source means now around Java, specifically J2ME (Java's mobile edition for gadgets such as cell phones) and the TCK (the technology compatibility kit).

Are you using a super-uber-customized Linux kernel, or are you guys pretty much vanilla?
I don't think there's such thing as a customized Linux kernel anymore. The kernel is incredibly flexible. It's got all these different architectures. I think the Linux kernel itself is this ubercustomized thing.

But do you have a lot of in-house customizations?
Not a lot. Google is exposed to some interesting hardware before the rest of the world. So internally we'll be sampling code for that hardware. So that's pretty custom stuff. But eventually that goes to the outside world. We funded some work with a group in Berkeley called Xorp to bring high-speed Broadcom networking chip functionality to Linux. It's not in our interest to keep control of it ourselves. So is it customized? Absolutely. But is it heavily customized? I don't think it is as heavily customized as you might think.

Is it true you still use 2.4 kernels?
In some places, sure.

How about for the core search product?
I don't know how it's partitioned out. When you think of Google, you think of search being on top of a kernel that's static. It's not always like that. It differs on data centers. I think 2.6 predominates, though.

There's been discussion about reciprocity. When General Public License (GPL) version 3 came out, the Free Software Foundation dumped the Affero clause out of GPLv3 and split it out into a separate license. Eben Moglen (co-founder of the Software Freedom Law Center and then counsel to the Free Software Foundation) said, to paraphrase, "If Google starts getting too parasitic, then we'll re-evaluate it." How worried are you of getting a negative perception of using more than you contribute?
I do worry about this. I think it is a largely incorrect perception. You can always give out more, and there are always people who will never be satisfied. Could we be giving back more? Sure. One of the ways I ameliorate that problem is (through) projects like the Summer of Code. Google is releasing every year, not counting Android or the really large open-source projects like GWT, a new project every two or three weeks. Or patching hundreds of projects a month. I conservatively estimate we're releasing about a million lines of code a year from the company.

If you talk to open-source developers--people who are working on projects--I think they understand that. It came back to who do we want to interact with. I always felt the enthusiast community would understand that eventually, and I think that's true. There are some people who are upset with us because we didn't embrace the Affero-style GPL, but it's not practical for us to do so. When they had an Affero-style clause in GPLv3, the thing I told Eben was, "Listen, you can adopt whatever you want. We'll still keep on backing up the FSF and the SFLC as much as we can, but it means we won't be able to use that license inside, because it won't be practical for us to do so." I think that's a very realistic response. The Affero GPL is out there. That's great for the people who use it. It's just not for us.

That's the thing about free software. You're not obligated to use it. We have enough fine-grained control within the company that we don't use things we don't want to use.

What are your preferred licenses?
We generally release under the Apache License--Apache 2. We think it has the fairest language of the licenses. And the GPL requires a lot of management--more than we have time for to run a project well under that license--patch flow and all that. Apache 2 encourages people to take the thing and run with it. That's what we're going for when we release code, whether it's to have people adopt technologies we really like, or for API examples. That said, we've released things under the GPL, LGPL, GPL version 3, BSD. We default to the Apache License.

To what extent to you subsidize gurus to sit around and work on important projects?
We've got people like Jeremy Allison and Andrew Morton and some of Guido (van Rossom)'s time. He's been working pretty heavily on Google App Engine and Mondrian. It's more common that we...try to make open source a part of their job, so they're patching out to the libraries they use. We think that's more healthy than having people whose job is just working on an open-source project.

You use open source a lot internally. Do you have some kind of intellectual property vetting or review before you use it?
We do. There are two ways we do this. When somebody wants to bring a piece of code in from the outside world--open-source or commercial--you need to put it inside a special directory we call "third party." They're required to put in a file called readme.google (that describes) where they got that software, how it's licensed, what category that license falls under. We look for things that are obvious. There are some projects that have dubious intellectual property provenance, and we know those, and we know the people who run them, and we tend not to use those ever.

Since Google doesn't distribute a lot of software, we have it easier than companies that ship hardware and software. We have a couple situations where that does happen--the Google Search Appliance, some of the downloadable applications. Those get a little extra attention. Similarly, when we have larger projects like Google Android, we have a higher ceremony--every two weeks we get together and see if the license picture has changed.

The tracking model works really well for us. We have tools written where a program manager or a release manager can turn on a certain level of warning within the build tool and it will tell them what open-source software they have and how they have to comply with it. At that point we set up a mirror for them as they get closer to release.

So that's the first way we track things. The second way is whenever a Googler puts in a changelist now--this is something we're just starting to do--we compare it against all known open-source code on the Internet using our Code Search product. We compare the changelist that comes from your average Google engineer against that database of code and we look for intersections. When we find an intersection, we take a look and see if it's truly a copy. And if it is, we make sure it's in the right directory and that it's properly labeled. And we call up the engineer if it isn't and make sure it gets tagged properly so we can do the right thing by these licenses.

That tool is kind of in its infancy. We're trying to figure out ways to automate what it does. But it's great because it scales programmatically. Our group's goal is not to break builds or stop development. It's to enable developers to use as much open-source as possible. We think it's healthy, because then they're not writing that code, they're writing other code.

Do you vet code for patent or copyright?
No. We have legal people on our lists. We have two main lists that track these things. Open-source licensing for incoming code and open-source releasing for outgoing code. Legal has a presence there. Patents are incredibly tricky.

Is it easier to get hired at Google if you have experience maintaining your own open-source product or patch?
If you have made a name for yourself in open source, clearly it helps. If you have a healthy project in open-source, I believe it helps. One thing I see on hiring committees is when somebody has an open-source history, it's really great. You can just look at that history. Interviews are great, but they're not very deep. They're only 45 minutes long. So how can you really get a feel for if a person is good at programming, at computer science?

Or at social relations, for that matter.
Open source really reveals that incredibly quickly. You can look at their code, at their activity on mailing lists, how they deal with bugs from real people, and real user problems. That's an incredible resource.

The Summer of Code isn't really a recruiting program. If it is, it's a really expensive one. Last year we created about 2 million lines of open-source code across the 900 students who took part. Of those probably a third are going to stick around with their projects, because the rest have to go back to college.

We have a couple students who have been in the program two or three years. The whole point is to support kids over the summer so they can go and program and not get some other job that has nothing to do with computer science. It's our fourth year doing it. This year we've go 1,109 students doing it across 95 countries.
source-http://news.cnet.com/

No comments: