Git work flows in the upcoming 2.7 release
The upcoming Open Build Service (OBS) 2.7 release will deliver massive improvements to the way we are dealing with git sources for builds.
OBS was designed for Linux distribution creation, not software development. In your typical distribution creation work flow you get a new upstream release in the form of a tar ball from time to time, and you add patches on top of that for local fixes. Nowadays the OBS is also used for development of software projects not in the context of a distribution. That work flow has completely different requirements: for every commit a new build, developers want continuous builds.
The problem
In the past a build on the OBS would fetch the entire git repository and create a tar ball every time. Even if you only changed on single bit, the repository would be fetched and a new tar ball would be generated and stored. As you can imagine this was not efficient at all. Another problem was that there was no simple way to try (local) builds of changes. The developer had to commit the changes to the git repository or create local patches and incorporate them into the build. Both work flows where pretty disconnected.
The Solution
In the upcoming OBS 2.7 we have introduced a new format for incremental storage of git commits. This has a number of advantages:
- We can store the cloned source archive in an efficient way which means OBS instances will use significantly less disk space for builds with git repositories as sources.
- The git checkout becomes available as directory for the build. You have the chance to implement a variety of source services that interact with the checkout.
- The git checkout becomes available as directory in your osc checkout. You can do your git work directly from the osc checkout and test them with a local build.
An example
A practical example based on a package may help to explain how this works. This example uses qgroundcontrol a drone control application.
First, create the package container as usual:
then tell the OBS where the git repository is via:
which will create a source service (_service) file like this:
and finally add a RPM spec file (copied from the stable package in this case):
When committing this, the OBS fetches the checkout, builds a tar ball and runs
the build. If you have been using source services for git repositories before
nothing really changed in your work flow yet, right? Now comes the interesting
part: Working with git in your osc checkout.
Working with git in your osc checkout
Let’s say you want to fix something in your git repository AND you want to make sure it still builds.
First check out the package with:
create the git checkout with:
an unversioned directory called “qgroundcontrol” will be created.
You can go inside and modify any file and start a local build.
The source service will run again, fetch changes from the git repository (incremental
changes, not the entire repository), and then apply your local changes on top
of it.
You can simply commit your local changes to the package, or you can of course push them using git. How awesome is that!?
Some more tricks
The obs_scm service also allows you to fetch files directly from the checkout. For instance you could maintain your build description (e.g. RPM spec file) in git and let OBS use it by using the “extract” parameter for the service.
Warning: We have not yet released the new tools. If you want to try this now please add the repository for your distribution from the openSUSE:Tools project and install osc, build and obs-service-tar_scm from there onto your workstation. You have to do the same for your OBS project, it has to build against the openSUSE:Tools project to get the tools for creating the tar ball.
These steps won’t be necessary anymore when we have succeeded in releasing the osc and source service updates for our distributions.
Algorithm details
As most commits only change some lines in a few files, we want to store only those changes. The OBS is pretty file-centric, so we want to keep the tar balls stored as single files. But we want to only store the deltas between those files, like the xdelta or rsync tools do. Delta algorithms work by finding common blocks in files. Unfortunately this means that the files cannot be compressed, as the compression result is near to random bytes. In the end we came up with our own algorithm for the very simplistic cpio new format.
As an example, consider that we have a file
and we want to change it to:
We split one of the files into blocks of fixed size:
We hash the block contents with a rolling checksum, so that we can
look them up in a fast way.
Then we go through the other file and try to find those blocks:
Ah, nice, found a hit. Extend it in both directions as long as the
content is the same:
So we encode “take 16 bytes from offset 0”. Then continue searching:
No match. Roll in one byte:
No match. Roll in one byte:
… roll roll roll…
A hit! Extend in both directions:
The still rolled out bytes are:
So we encode both “take the following 3 bytes: ‘fox’” and “take 28 byes from offset 20”.
Now we’re done, the final recipe is:
- take 16 bytes from offset 0
- take the following 3 bytes: ‘fox’
- take 28 byes from offset 20
The result
For testing we have used 1024 byte blocks and encoded all kernel tar balls from the “kernel-vanilla” packages of all projects on the reference server. That meant we needed to encode 2525 tar balls consisting of 207GB of data (1194 GB convert to uncompressed cpio balls).
Our delta algorithm took each cpio file and converted it to instruction files plus data added to the delta store. The instruction files sum up to 3.6GB whereas our delta store is 4.7GB (after compression). So in total we use 8.3 GB, which only 4% (!!!) of the original compressed tar balls.
Pretty significant storage reduction and cool new ways to interact with git. We are excited about these changes. What do you think? Are you going to make use of this once it’s released? Let us know on the mailing list, on IRC or in the comments!