DeltaRepo Design - Tojaj/DeltaRepo GitHub Wiki

Set of tools that generate/merges differences between an old and a new version of a repodata.

Idea

  • Repodata could be pretty big (dozens or hundreds of megabytes).
  • Changes between two versions of repodata could be very small (a single deleted package).
  • Let's make a tool that could detect changes between two repodata, generate its delta (diff) + tool that could apply this delta on the old repodata.

Design ideas

  • Maximal compatibility.
  • Avoid significant changes in the repomd.xml or other repodata files.
  • Repodelta delta itself is a repository.
  • Plugins - Generic DeltaRepo can create and apply delta files for primary, filelists and other. Delta from other repodata files (groupfile, prestodelta, ..) could be done via plugins.

DeltaRepo

File structure of a single delta repo looks like:

repodata/
  |-primary.xml.gz
  |-filelists.xml.gz
  |-other.xml.gz
  |-deltametadata.xml.xz
  |-repomd.xml

Where primary, filelists and other are in classical format but their content is composed only from changed or added packages.

deltametadata.xml will be described bellow.

deltametata.xml.gz

This file contains records for each used plugin. It is considered as a persistent storage for delta plugins, which could store here an important configuration and data which are necessary for further application.

Example:

<?xml version='1.0' encoding='UTF-8'?>
<deltametadata>
  <revision src="123" dst="456"/>
  <contenthash src="abc" dst="bcd"/>
  <timestamp src="120" dst="450"/> <!-- The highest timestamp from metadata -->
  <usedplugins>
    <plugin name="MainDeltaPlugin" version="1" src_contenthash="abc" dst_contenthash="bcd" contenthash_type="sha256">
      <removedpackage location_href="../packages/fake_bash-1.1.1-1.x86_64.rpm"/>
      <metadata database="1" type="filelists"/>
      <metadata database="1" original="1" type="other"/>
      <metadata database="1" type="primary"/>
    </plugin>
  </usedplugins>
</deltametadata>

Each <plugin> element should have name and version attribute. Other attributes can added by plugin with no limitation. What a plugin stores in the deltametadata file depends on its needs. There is no prescription about what has to be stored by the plugin.

Plugins must not read/write/modify the deltametadata file on its own! This file is parsed by deltarepo and accessed via the PluginBundle object.

Description of the above mentioned example:

As you can see, a plugin with name "MainDeltaPlugin" (the core plugin for metadata of type: primary, primary_db, filelists, filelists_db, other and other_db) stores there a list of removed packages and list of notes about each metadata file which was processed.

Each metadata element contains information like:

  • Type of metadata on which this record is applied.
  • Should be database generated (database attribute).
  • Is the file in the delta repository a real delta or just copy of the original file (original attribute)

TODO: Relax NG schema (?)

repomd.xml

Classical repomd.xml, but with contenthash (Examined bellow).

Contenthash of deltarepo

Contenthash is case of Deltarepo is a string:

contenthash_of_old_repo-contenthash_of_new_repo

Eg: 5a8e6bbb940b151103b3970a26e32b8965da9e90a798b1b80ee4325308149d8d-b8d60e74c38b94f255c08c3fe5e10c166dcb52f2c4bfec6cae097a68fdd75e74

Contenthash of a repo

Contenthash is an identifier based on content of repo's primary.xml.

It is a hash calculated from all packages listed in primary.xml.

Calculation algorithm:

pkgids = []
for pkg in repo:
  # Assume that pkg.pkgId and pkg.location_href are never empty (has a None value)
  pkgids.append("%s%s%s", pkg.pkgId, pkg.location_href, pkg.location_base or '')
contenthash = hashlib.new("sha256")
for pkgid in sorted(pkgids):
  contenthash.update(pkgid)
return contenthash.hexdigest()

"Extra" repodata files

"Extra" repodata files are metadata files for which a sophisticated delta function is not currently provided.

DeltaRepo currently could do a sophisticated deltas of base metadata files like:

  • primary.xml
  • filelists.xml
  • other.xml
  • primary.sqlite
  • filelists.sqlite
  • other.sqlite
  • repomd.xml

Example of extra repodata files:

  • comps.xml
  • deltainfo
  • pkgorigins
  • prestodelta.xml
  • ...

For this "unsupported" files the algorithm is used:

  • If a file in the new repo is the same as the one in the old repo, just make a note to the deltametadata.xml that the old file should be reused during delta application and the file is not included the deltarepo.
  • If a file in the new repo is different than the one in the old repo. Include copy of the new file to the deltarepo. (If the file is not compressed then do the compression to save some space).

Integration to the current repodata

Tree structure:

mirror/
  +-deltarepos
  |   +-ei7as764ly-043fds4red
  |   |   +-repodata
  |   |      |-primary.xml
  |   |      |-filelists.xml
  |   |      |-other.xml
  |   |      |-removed.xml
  |   |      |-repomd.xml
  |   |
  |   +-0w78as1r9r-043fds4red
  |   |   +-repodata
  |   |      |- ...
  |   |
  |   |-deltarepos.xml.xz
  |
  +-Packages
  |   |- ...
  |
  +-repodata
      |-primary.sqlite.bz2
      |-primary.xml.gz
      |-filelists.sqlite.bz2
      |-filelists.xml.gz
      |-other.sqlite.bz2
      |-other.xml.gz
      |-repomd.xml

Notes:

  • deltarepos.xml
  • Compressed by xz because of its great compression ratio.

deltarepos.xml

<?xml version="1.0" encoding="UTF-8"?>
<deltarepos>
  <deltarepo>
    <location href="deltarepos/deltarepo-1387087418-iUpWS4" />
    <revision src="1387077214" dst="1387087288" />
    <size total="15432" />
    <contenthash type="sha256" src="ei7as764ly" dst="043fds4red" />
    <timestamp src="1387075214" dst="1387086288" />
    <data size="15892" type="deltametadata"/>
    <data size="2829865" type="filelists"/>
    <data size="323506" type="other"/>
    <data size="520736" type="primary"/>
    <repomd>
      <timestamp>1387087412</timestamp>
      <size>4314</size>
      <checksum type="sha256">9d2c14fb2</checksum>
    </repomd>
  </deltarepo>
  <deltarepo>
    <location href="deltarepos/deltarepo-1387087345-XDsFe4" />
    <size total="7869" />
    <revision src="1387077745" dst="1387087288" />
    <contenthash type="sha256" src="0w78as1r9r" dst="043fds4red" />
    <timestamp src="1387075417" dst="1387086288" />
    <data size="16001" type="deltametadata"/>
    <data size="2831887" type="filelists"/>
    <data size="325532" type="other"/>
    <data size="520943" type="primary"/>
    <repomd>
      <timestamp>1387087340</timestamp>
      <size>4275</size>
      <checksum type="sha256">eo38KMbYO4</checksum>
    </repomd>
  </deltarepo>
</deltarepos>

Needed changes in the current state of repodata

repomd.xml

We need to store contenthash.

Add new element "contenthash"

<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1355393568</revision>
  <contenthash type="sha256">5a8e6bbb940b151103b3970a26e32b8965da9e90a798b1b80ee4325308149d8d</contenthash>
  <data type="primary">
    ....

Use cases

./deltarepo repo1 repo2

  • Create dir which has a contenthash as a name and contains repository delta information.

./deltarepo --apply repo1 delta

  • Applies delta on the repository

Options:

$ ./deltarepo.py --help
usage: deltarepo.py [options] <first_repo> <second_repo>
       deltarepo.py --apply <repo> <delta_repo>

Gen/Apply delta on yum repository.

positional arguments:
  path1                 First repository
  path2                 Second repository or delta repository

optional arguments:
  -h, --help            show this help message and exit
  --version             Show version number and quit.
  -q, --quiet           Run in quiet mode.
  -v, --verbose         Run in verbose mode.
  -o DIR, --outputdir DIR
                        Output directory.
  -d, --database        Force database generation
  --ignore-missing      Ignore missing metadata files. (The files that are
                        listed in repomd.xml but physically doesn't exists)

Delta generation:
  -t HASHTYPE, --id-type HASHTYPE
                        Hash function for the ids (contenthash).
                        Default is sha256.

Delta application:
  -a, --apply           Enable delta application mode.
⚠️ **GitHub.com Fallback** ⚠️