On Centralisation of Code Hosting Infrastructure—An Argument

Many, many Free/Libre Open Source Software (FLOSS) projects have their code and/or related content (mostly issues, wikis and pull requests, but also documentation and websites) pooled on large cloud providers. The most notable one is GitHub, probably followed with a large margin by BitBucket and GitLab (I don’t have the actual numbers on this, but I don’t think anyone will disagree on this).

Earlier this year, GitHub was bought by Microsoft. Microsoft, historically The Devil in FLOSS circles. This has created a huge outcry and an exodus from the GitHub platform to others, most notably GitLab (who played the event very cleverly, PR-wise). They saw an increase of GitHub imports by at least 30 dB.

Of course, the question is, what are you even fearing Microsoft could do? As a FLOSS project, there is typically not much private information; the code data and issues are public (but see below). And what makes Microsoft, in this context, worse than the old GitHub management? Or the current GitLab management? BitBucket?

In this vein, I’ll discuss the pros and cons I see for the current centralisation of code hosting to major providers (or, in reality, mostly GitHub actually).

What Services Are We Talking About?

In case it isn’t clear, here’s the services which those platforms provide:

Authentication and Authorization: They allow you to log in, and allow you to control who has write access to the projects data, possibly in rather fine-grained steps.
Code hosting: They provide access to the code to the public.
Issue/Pull request management: They provide tools to create and manage issues and pull requests. Often, rather advanced code review tools are included.

What Are The Threats?

To figure this out, let us look at what is at stake. Treating the code hosting platform as an attacker and figure out what they can do. Going into classic security theory, let us look at how the platform can danger the three security goals Confidentiality, Integrity and Availability (CIA; funny, right?).

Confidentiality

This one is easy and hard at the same time. Easy, because the vital data of FLOSS projects is public. Hard, because the part which isn’t varies from platform to platform. Here’s a common denominator:

User passwords
User emails (if different from what is in the commit logs anyways)
IP addresses used to access the service during pushes, pulls or when editing data in the webinterface (e.g. issues)
Timing of interactions with the website (not necessarily correlating with commit times)

The users have to trust the platform with this data in order to use it. The platform could abuse the data for profiling or leak it to the public (probably inadvertently).

Integrity

This is a big one. The platform has control over the code and the metadata (commits, issues etc.). This could lead to major issues, for example, if the platform started to include obfuscated crypto currency mining code in all Node.js projects (contrived example!).

To prevent this, maintainers need to take a look at commit logs and audit new commits. Thanks to how modern distributed version control systems (DVCS) work, there is no way such a platform could inject code into old commits without it being noticed by everybody. So the only way to threaten the integrity of the code would be by appending commits to the tip of e.g. the master branch.

Depending on the project size and organisation, such an attack is probably feasible: many people working on the master branch concurrently will lead to folks not being wary when they can’t push because a new commit has been added in the meantime.

Availability

Of course, the service can shut down at any instant. It can cease to provide free services to FLOSS projects. It could restrict its platform to projects which are not under the GPL. All of this would be threats to the availability of the service to FLOSS projects.

This is a bummer, but thanks to DVCS, at least the code is easily replicated on another platform, or even a self-hosted thing, if one platform chose to do this.

Other data can typically be exported using specialised tools; paranoid (or rather, cautious) projects could set up a periodic task which downloads all the non-code data from the service, to be able to restore it somewhere else in case it turns against them.

What Are The Benefits?

To figure out the benefits of a large centralised platform, let us look at the drawbacks or challenges you face when providing code hosting for your (or a bunch of people’s) projects.

You need to grant those people write access to your server. They will be able to upload arbitrary data, which will then be publicly available (I’m going to let aside the issues of code injection attacks on your domain, because I assume that whatever code hosting software you use is secure against that). This means that you either have to employ monitoring and filtering of the content which goes up, be prepared to deal with someone uploading stuff you really do not want to have under your domain (and somebody else noticing), or trust your users.

For now, I’d say that "trust your users" is going to work in this scenario. Let this be a friends&family code hosting service; nobody will intent to harm you or the service. So that’s not an issue.

However, your users will want to be able to receive contributions from wider community. This means that "everyone" must be able to create issues and pull requests for the projects. This means that they need accounts, or that you open up the instance for anonymous comments/issues (really not a good idea).

Having to create an account is a bad user experience. Not having to create a new account for each project I want to contribute to is one of the great benefits of GitHub and similar (older) platforms. Some systems have solved this with "shim" accounts, such as the Prosody issue tracker: You "sign up" using just your email address in the same step when you send your first comment. A confirmation email is sent; once confirmed, your comment is posted and you can post further comments, until your browser forgets about the secure cookie which has been set for you.

This type of stuff would be nice to have in software such as Gitea. However, from my experience with co-hosting a large forum community, I learnt that once you get big enough, spammers will implement the few lines of code to match your specific flavour of confirmation email (a concept they’ve been dealing with since the early days of phpBB). So this is not going to fly in the long run.

Aside from that, there’s another big issue. If you want people to post pull requests, they (with the current model of how things work) need to fork the repository on your instance. And then they need to be allowed to add code to that fork. And that’s the point where you not only granted your friends&family write access on your server, but the entire internet.

You will be faced with abuse. It is just a matter of time.

I think those two (not having to create an account for each project, and not having to deal with abuse as a project maintainer) are major points which speak for the use of a few, centralised code hosting platforms. And they are big ones, compared to the threats from the previous section.

Where To Go From Here?

Of course, as I’ve written in an earlier post, I value my privacy. I also value decentralisation in general; having too much power in a few hands is rarely a good thing, because it poses the temptation to abuse it.

So what are the steps we, as the FLOSS community, have to take forward?

A federated authentication service, which provides all the typical account stuff (user name, email, possibly avatar and short bio, real name, etc.) and allows to control precisely with which platform which amount of data is shared. The important part is that, for example when changing your avatar or email address, you don’t have to go to all the various code hosting instances and change it there, but have a single, centralised (for you; it is still federated so that other folks could be using other services) point where you can update this info.

The service would then be responsible for pushing the changes to the consumers of your user data. It would also push deletions of (parts of) the user data when you revoke access to (parts of) it.

One way of implementing this would be using the extensible messaging and presence protocol (XMPP), standardised by the IETF. It offers a way to share this information, and to push updates to consumers. In addition, it supports federation so that each (group of) users can host their own server to prevent lots of user data accumulating in a single place.

Full disclosure: I have been working on XMPP and XMPP clients for a few years now and am member of the XMPP council for the term 2018/19.
A way to create pull requests cross-instance. This requires that the code hosting instances can talk to each other during user interaction (such as posting (review) comments, adding commits, etc.). You would want to have notifications about new comments in your "home" instance, and likewise you’d want your new commits to be visible on the "target" instance right away without further action.

This, again, could be solved with XMPP. Huh, I didn’t intend this post to be an advertisment piece for XMPP, but here goes. Essentially, this is the kind of ground feature set required for federated social networking. And there has been prior art for doing this type of stuff in XMPP. goffi, who is well known in the XMPP community for their work on social network use-cases and Salut-á-Toi, has laid the groundwork for a decentralised code forge based on XMPP.

However, even with such decentralised tools, we would still have to have people who provide the service. Project maintainers are not always interested in also hosting services. So, just like with email and instant messaging, the ideal would be if groups would flock together and host their own services together, spreading the load among those who also happen to have an interest in maintaining services.

Conclusio

So, uh. This went in a different direction than I originally intended to. To wrap the argument up:

For the current state of the art, I think that we have to live with centralised code hosting platforms such as GitHub. It does its job fairly well. It doesn’t hurt (too much). It takes a massive load off the FLOSS community by handling abuse and providing a rather stable and reliable service.

In the future, we might come up with technologies which allow us to avoid this and have secure, reliable and usable decentralised code hosting. I’m hoping that we get there eventually.