Extensible Git and Mercurial repositories using Amazon S3

I'm Sure many of you have heard or know from personal experience, what version control system do not play nicely with binary files, large files and especially with large binary files. Here and below we are talking about modern popular distributed version control systems like Mercurial and GIT.

Often, it doesn't matter. I do not know the reason or the result, but the version control system used mainly for storing relatively small text files. Sometimes some images or libraries.

If the project uses a large number of pictures in high resolution, sound files, source files, graphics, 3D, video or any other editors, it is a problem. All of these files are usually large and binary, which means that all the benefits and conveniences of version control systems and repository hosting with all soputstvuyuschie services become unavailable.

Next, we have the example consider the integration of version control systems and Amazon S3 (cloud file storage) to use the advantages of both solutions and make up for the shortcomings.

The solution is written in C#, the API uses Amazon Web Services and shows an example of settings for Mercurial repository. The code is open, the link will be at the end of the article. All written more or less modular so adding support for anything other than Amazon S3 should not be too difficult. I can assume that GIT's configure will be as easy.

So, it all started with the ideas need a program, which after integration with the version control system and the repository would work is completely transparent, requiring no additional action from the user. Like magic.

To realize the integration with the version control system you can use so-called hooks (hook) events, which you can assign your own handlers. We are interested in that run at the time of receiving or sending the data to another repository. From Mercurial needed hooks are called incoming and outgoing. Accordingly, we need to implement one command for each event. One to download the updated data from the working folders to the cloud, and the second for the reverse process — downloading updates from the cloud to the working directory.

Integration with the repository is using the file metadata or the index file or as you wish. This file needs to contain a description of all the tracked files, at least the way to it. And this file would be under version control. Themselves tracked files will be in .hgignore, ignore list, otherwise you miss the whole purpose of this venture.

Integration repository

A file with the metadata looks something like this:

the

<?xml version="1.0" encoding="utf-8"?>
<assets>
<locations>
<location>Content\Textures</location>
<location>Content\Sounds</location>
<location searchPattern="*.pdf">Docs</location>
<location>Reference Libraries</location>
</locations>
<amazonS3>
<accesskey>*****************</accesskey>
<secretkey>****************************************</secretkey>
<bucketname>mybucket</bucketname>
</amazonS3>
<files>
<file path="Content\Textures\texture1.dds" checksum="BEF94D34F75D2190FF98746D3E73308B1A48ED241B857FFF8F9B642E7BB0322A"/>
<file path="Content\Textures\texture1.psd" checksum="743391C2C73684AFE8CEB4A60B0317E634B6E54403E018385A85F048CC5925DE"/>
<!-- And so on for each tracked file -->
</files>
</assets>

In this file, three sections, locations, amazonS3 and files. The first two are user-configurable in the beginning, and the latter is used by the program to track of the files themselves.

Locations is a path which will be searched for the tracked files. It is either absolute paths or paths relative to the xml file with the settings. These routes need to be added to the ignore file system kantola versions, so she tried to track them.

AmazonS3 — it is not difficult to guess, settings cloud-based file storage. The first two key — Access Keys that can be generated for any user of AWS. They are used to cryptographically sign requests to API Amazon. Bucketname is the bucket name, the entity inside Amazon S3, which can contain files and folders and will be used to store all versions of monitored files.
Files do not need to configure, as this section will be to edit the program itself in the process of working with the repository. It will contain a list of all files of the current version with the paths and hashes to them. Thus, when pull together we'll get a new version of this xml file, comparing the content section of Files with the contents of the monitored folders themselves, you can see which files were added, some modified and some just moved, or renamed. During the push, the comparison is performed in the opposite direction.

Integration with version control system

Now about the teams. The program supports three commands: push, pull and status. The first two are designed to configure the appropriate hooks. Status displays informatio about monitored files and its output resembles the output of hg status — it is possible to understand what files have been added to working directory, modified, moved, and which files were not missing.

The push command works as follows. For starters, it turns out the list of monitored file from the xml file paths and hashes. This will be the last recorded in the repository state. Further information is gathered about the current state of the working folder path and hashes of all tracked files. Then compare both lists.

There can be four different situations:

working folder contains a new file. This occurs when no match either in putai, no hashes. The result is updated xml file, it adds a record about the new file, and the file is uploaded to S3.
working folder contains a modified file. This occurs when there is a match on the way, but there is no match in the hash. In the result xml file is updated, the corresponding record is changed the hash in S3 and downloaded an updated version of the file.
working directory contains the moved or renamed file. This occurs when there is a match on the hash, but there is no match in the way. In the result xml file is updated, the corresponding record is changed the path, and S3 to download anything no need. The fact that the key to storing files in S3 is the hash and path information is only in the xml file. In this case, the hash has not changed so download again the same file in S3 does not make sense.
monitor the file would be deleted from the working folder. This occurs when one of the entries of the xml file does not match any one of the local files. As a result, this entry is deleted from the xml file. From S3 never removed, since its main purpose is to store all versions of files so you can revert to any revision.

There is a fifth possible situation is not changed. This occurs when there is a match and on the way, and hash. And no action in this situation is to take it is not required.

The pull command compares the file list from xml with a list of local files and works exactly the same way, but in the other direction. For example, when the xml contains an entry for your new file, i.e. did not match either on the road or hash, then this file is downloaded from S3 and writes locally at the specified path.

An example hgrc with access hooks:

the

[hooks]
postupdate = \path\to\assets.exe pull \path\to\assets.config \path\to\checksum.cache
prepush = \path\to\assets.exe push \path\to\assets.config \path\to\checksum.cache

Hash

Treatment S3 is minimized. Uses only two commands: GetObject and PutObject. The file is loaded and skachivaniya from S3 only in the case of a new or modified file. This is possible through the use of file hash as the key. Thus all versions of all files in S3 Bucket without any hierarchy, without folders at all. There is an obvious negative — collision. If two files have the same hash, the information about one of them just snaps into place in S3.

The convenience of using hashes as the key still outweighs the potential danger, so to abandon them would not be desirable. We only need to consider the probability of collisions, it is possible to reduce it and make the consequences not as fatal.
To reduce the likelihood of very simply, you need to use a hash function with a longer key. In my implementation I used SHA256, which is more than enough. However, this still does not exclude the probability of collisions. Need to be able to identify them before were made any changes.

Make it also not difficult. All local files are now hashed before performing the command push and pull. We need only to check if any of the hashes matches. Enough to do the test during a push to this zafiksirovano in the repository. If the detected conflict is displayed to the user a message about the trouble and offered to change one of the two files and make a push again. Given the low probability of such situations, this solution is satisfactory.

Optimizing

To this program there is no hard performance requirements. It works one second or five — not so important. However, there are obvious places that can and should be taken into account. And probably the most obvious — is the hashing.

The approach assumes that while executing any of the commands need to calculate all the hashes of monitored files. This operation can easily take a minute or more if you file a few thousand, or if their total size is bigger gigabytes. To compute the hashes for a full minute is inexcusable for a long time.

If you notice that the typical usage of the repository is not predpolagaet change to all files immediately before push, then the solution becomes obvious — caching. In my implementation I decided to use pipe delimited file, which was lying next to the program and contain information about all first computed hashes:

the

path to file|file hash|date to calculate the hash

This file is loaded before executing the command, used in the process, is updated and saved after executing the command. Thus, if the file logo.jpg the hash was last computed one day ago and the file was last changed three days ago, then re-calculate its hash has no meaning.

Also optimizacia may be a stretch to call the use of BufferedStream instead of the original FileStream to read files, including read-to calculate the hash. Tests have shown that the use BufferedStream with buffer size of 1 megabyte (instead of the standard for FileStram 8 kilobytes) to calculate hashes 10 of thousands of files total size of more Gigabyte speeds up the process four times compared to the FileStream for the standard HDD. If the files are not many and they themselves are larger than megabytes, the difference is not significant and amounts to about 5-10 percent.

Amazon S3

Here it is necessary to clarify two points. The most important is probably the price issue. As you know, for new users the first year of use is free, if not to go beyond the limits. The following limits: 5 GB, 20000 GetObject queries per month and 2000 PutObject requests per month. If you pay full price, then a month will cost you about $1. For this you get redundancy across multiple datacenters within a region, and good speed.

Also, I will assume that the reader from the beginning tormented by the following question — why did this bike, if you have Dropbox? The fact that the use of Dropbox directly to work together is fraught — he is totally unable to cope with conflicts.

But what if you use not directly? In fact, in the described solution Amazon S3, you can easily replace for the same Dropbox, Skydrive, BitTorrent Sync or other analogs. In this case they will act as the repository for all file versions and hashes will be used as file names. My solution for this is implemented via FileSystemRemoteStorage analogue AmazonS3RemoteStorage.

The promised link to the source code: bitbucket.org/openminded/assetsmanager

Article based on information from habrahabr.ru

Search This Blog

computer express