violation of rhn_erratafile_eid_file_uq (bsc#862043) - uyuni-project/uyuni GitHub Wiki

Intro

Bug reference: https://bugzilla.suse.com/show_bug.cgi?id=862043

Description: user receives an internal server error when trying to clone sles for vmware channels.

from catalina.out:

2014-02-03 09:39:29,841 [TP-Processor1] WARN  org.hibernate.util.JDBCExceptionReporter - SQL Error: 0, SQLState: 23505
2014-02-03 09:39:29,841 [TP-Processor1] ERROR org.hibernate.util.JDBCExceptionReporter - ERROR: duplicate key value violates unique constraint "rhn_erratafile_eid_file_uq"
  Detail: Key (errata_id, filename)=(1595, SuSEfirewall2-3.6_SVNr208.2.5.1.noarch) already exists.
2014-02-03 09:39:29,841 [TP-Processor1] ERROR org.hibernate.event.def.AbstractFlushingEventListener - Could not synchronize database state with session

Hunt 1

TL;DR: spacecmd softwarechannel_adderratabydate -p can publish packages from unrelated channels, by design

The bug emerges because the following constraint is violated:

    ALTER TABLE rhnErrataFile
    ADD CONSTRAINT rhn_erratafile_eid_file_uq UNIQUE 
    (
      errata_id, filename 
    )

As with all constraints, it could fail:

  • on UPDATE: I cannot find any reference to those fields ever being UPDATEd across all our code bases;
  • on INSERT.

INSERTs by code bases:

  • Python code never INSERTs into that table (except for ISS, but that is not relevant in this case);
  • Java code never INSERTs via SQL, so the only possibility is Hibernate.

Fact that error is produced through Hibernate also explains the stack traces (error pops up at Hibernate flush() time, which might be far from the point where the problem is originated).

Hibernate mapping for this table is in PublishedErrataFile.hbm.xml and mapped class is PublishedErrataFile, thus INSERT INTO rhnErrataFile queries result from PublishedErrataFile() constructor calls.

Fortunately that constructor is called only from ErrataFactory.createPublishedErrataFile(), which is called only by ErrataFactory.publishErrataPackagesToChannel(errata, channel, user, packages), which is the failing method.

Turns out, the method was written assuming no two packages will ever have the same NEVRA (according to original author), which unfortunately is a wrong assumption. In fact:

  • among different channels, this is entirely okay. For example, a lot of CentOS 6 packages are identical to their RHEL6 counterparts. In general, NEVRAs have a meaning per-distro, and we could have several;
  • a slightly more tricky case is noarch packages from the same product for different architectures. For example ant-1.7.0-200.26.1.noarch exists both in SLES11-SP2-Updates for x86_64 and SLES11-SP2-Updates for s390x because of the way OBS works, it will create noarch packages for all architectures (ironically!) with different checksums;
  • users can simply mess up with packages and erratas. I verified that adding multiple packages with same NEVRA is possible in various ways through the UI, API, spacecmd and other utility scripts. Nothing currently prevents that.

More specifically, publishErrataPackagesToChannel() will fail if:

  • more than one of the packages passed as parameter have the same NEVRA, or
  • a different package with same NEVRA was already published. That is, if publishErrataPackagesToChannel itself was called previously!

(I added test cases in a throwaway branch to prove the above)

If we get to publishErrataPackagesToChannel in those circumstances, there is little we can do - either we skip the duplicate package or we replace the previous one. Both actions could result in a potentially incorrect errata being published - in the worst case, packages from very different repos could have very different contents and same NEVRA. We should simply avoid this by fixing code elsewhere.

Our users (at least some of them) see the constraint violation despite using only clones of official channels and not messing with them manually, so some bug evidently added unwanted NEVRA-duplicated packages that get picked up by publishErrataPackagesToChannel() callers and passed as parameter.

So who added those bad NEVRA-duplicated packages to the database? Well, the only possibility is publishErrataPackagesToChannel() itself! (when I realized this, it totally blew my mind - a recursive bug!)

A couple of hops up the stack trace, publishErrataPackagesToChannel() is used for:

  • publishing custom made erratas/packages via the WebUI or API. We could add warnings or errors, but we would not solve current problem - we are not fixing incorrect manual usage anyway;
  • implementing four Errata API methods:
  • publish()
  • publishAsOriginal()
  • clone()
  • cloneAsOriginal()

All four methods call publishToChannel() which is used to publish an errata to a channel. This means that, among other things, it will computes the package list that is relevant to the errata and add them all to the channel so that the errata can actually be satisfied from that channel. That package list is passed to publishErrataPackagesToChannel(), and if it contains incorrect packages they will end up in the database corrupting it.

Turns out that:

  • publishAsOriginal() and cloneAsOriginal() feed publishErrataPackagesToChannel() with results of ErrataManager.listErrataChannelPacks();
  • publish() and clone() feed publishErrataPackagesToChannel() with results of ErrataManager.lookupPacksFromErrataForChannel().

Good news: *AsOriginal() methods are safe and will return more than one row per NEVRA only if the channel already contains such packages. So they will just copy corruption, not add new corruption.

Bad news: vanilla methods are unsafe, they can and actually will return duplicate NEVRA packages if they exist, well ...in any channel of your installation.

Indeed, safe *AsOriginal() methods were added later because of related issues, see: Bugzilla entry for cloneAsOriginal(), Bugzilla entry for publishAsOriginal(). Those methods are safe in the sense that they only copy packages from original to cloned channel - that implies they only work on cloned channels.

WARNING: Since *AsOriginal are per clones only, this is also explains why they only copy the corruption. But this also mean that flipping this method as "safe" will limit the target software working only with the cloned channels.

Anyway, the underlying unsafe query is Errata_queries.xml's find_packages_for_errata_and_channel (reformatted for readability):

    SELECT DISTINCT
      P.id,
      PN.name package_name,
      PA.label as package_arch,
      EVR.version || '-' || EVR.release || (CASE WHEN EVR.epoch IS NULL THEN '' ELSE ':' || EVR.epoch END) package_nvre
      
      FROM
        rhnErrata E
        INNER JOIN rhnErrataPackage EP ON E.id = EP.errata_id
        INNER JOIN rhnPackage P ON P.id = EP.package_id
        INNER JOIN rhnPackageArch PA ON PA.id = P.package_arch_id
        INNER JOIN rhnPackageEvr EVR ON Evr.id = P.evr_id
        INNER JOIN rhnPackage P2 ON P2.name_id = P.name_id
        INNER JOIN rhnChannelPackage CP ON CP.package_id = P2.id
        INNER JOIN rhnPackageName PN ON  PN.id = P.name_id
        INNER JOIN rhnChannel CN ON CP.channel_id = CN.id

      WHERE
        E.id = :eid AND
        CP.channel_id = :custom_cid AND
        P2.Package_arch_id = P.package_arch_id AND
        CN.org_id = :org_id AND
        (E.org_id is NULL or E.org_id = :org_id)

Which basically means:

  • take an errata E by id;
  • in the left hand hand, take all packages P that belong to E, no matter what channel they come from;
  • in the right hand, take all packages P2 that are in E's destination channel (by :custom_cid, this is the ID of the channel where E has to be published);
  • start juggling, returning packages that are in your left hand but have same name and arch (not necessarily version!) from your right hand. That is, all packages from all channels that have same name and arch to packages already in the destination channel, possibly with a higher version.

If you followed me to this point, you shoud be frozen in horror - this is a very risky way to publish errata.

It is still an open question why do we need this at all. Upstream Spacewalk developers also aren't convinced.

Still, it can be acceptable to manually publish single erratas, users can always fix bad packages by hand. It is definitely not acceptable, though, as part of a cloning process. Cloned patches should definitely take packages only from cloned channel, as *AsOriginal() methods do. Even better, only *AsOriginal() methods shoudl be used.

Turns out that:

  • cloneAsOriginal() can be used:
    • directly via API (safe)
    • via spacecmd api (safe)
    • by spacewalk-clone-by-date (safe)
    • by spacecmd softwarechannel_adderrata (safe)
    • by spacecmd softwarechannel_adderratabydate without the "-p" flag (safe)
  • publishAsOriginal() can be used:
    • directly via API (safe)
    • via spacecmd api (safe)
  • clone() can be used:
    • directly via API (possibly unsafe)
    • via spacecmd api (possibly unsafe)
    • via scripts/clone_errata/rhn_clone_errata.py. This is an old utility script to get errata from RHN into Spacewalk. We apparently don't package or support it, yet I am not sure if some customers use it (unsafe);
    • by spacecmd softwarechannel_adderrata, only for very old versions of the codebase (XMLRPC API version < 10.11). Can't happen on SUMa 1.7, not sure on 1.2, could definitely have happened in old RHN installations converted to SUMa (unsafe);
  • publish() can be used:
    • directly via API (possibly unsafe)
    • via spacecmd api (possibly unsafe)
    • via scripts/clone_errata/rhn_clone_errata.py (unsafe);
    • by spacecmd errata_publish, as we cannot really use the AsOriginal() version here, since we are not cloning (possibly unsafe);
    • by spacecmd softwarechannel_adderratabydate with the -p flag (unsafe).

The very last line is the causes problems.

Prey 1

spacecmd {SSM:0}> softwarechannel_adderratabydate -h
softwarechannel_adderratabydate: Add errata from one channel into another channel based on a date range
usage: softwarechannel_adderratabydate [options] SOURCE DEST BEGINDATE ENDDATE
Date format : YYYYMMDD
Options:
        -p/--publish : Publish errata to the channel (don't clone)

This command publishes an existing errata into a (possibly unrelated) channel. To do that, it pulls packages from whatever channel into the target channel - among other things this can insert duplicate NEVRA packages.

This opens two cans of worms:

  1. should we limit its usage to cloned channels, so that packages can only be taken from original channel, in other words, use publishAsOriginal() instead of publish()?
  2. should we allow multiple NEVRA packages in a same channel?

Possible solutions were proposed, but they are superseded by findings in Hunt 2.

Hunt 2

TL;DR: constraint is violated, but the only use of the field is to display in WebUI. No need to worry too much.

Turns out at least one customer never used unsafe methods and never added packages/patches manually, yet the constraint is violated.

Reason is:

  • Errata are solutions to problems, thus they are naturally arch-agnostic by definition;
  • one or more Packages can be associated with an Errata;
  • typically, an Errata has Packages come from different Channels with different archs. This is again by definition, and totally normal;
  • Errata can have packages with same NEVRA, because nothing forbids that;
  • Some Errata inevitably have Packages with same NEVRA, which are updates to noarch packages that are basically the same on every arch, with different MD5. An example is suse-manager-registration-tools-3970;
  • Published Errata Files are named after Package NEVRAs;
  • constraint says that, given an Errata, no two files can have the same name;
  • that implies, no two Packages with same NEVRA can be published in a same Errata;
  • spacewalk-clone-by-date publishes patches automatically after cloning, so:
  • spacewalk-clone-by-date fails.

So, if you reposync two channels with different arch and same patches (eg. SLES11 SP1 Updates for i586 and SLES11 SP1 Updates for x86_64) and after that you try spacewalk-clone-by-date, it will fail - even if no channel has duplicate NEVRA packages.

More interestingly, we never figured this out because it did not happen on our test host. This is because spacewalk-clone-by-date will reuse an old and possibly outdated cloned errata if it finds one, and in our case we were lucky enough so that all erratas were very outdated clones with only one architecture. One architecture means no packages with conflicting NEVRA (see above). Does this sound confusing?

Another interesting fact is that the Java part inserts into rhnErrataFile.filename only the package NEVRA, see ErrataFactory.publishErrataPackagesToChannel():

    ErrataFile publishedFile = ErrataFactory
        .createPublishedErrataFile(
            ErrataFactory.lookupErrataFileType("RPM"), // rhnErrataFile.type
            pack.getChecksum().getChecksum(), // rhnErrataFile.checksum_id
            pack.getNameEvra() // rhnErrataFile.filename
        );

while the Perl part inserts the whole path, as read from rhnPackage, see Errata.refresh_erratafiles. This inconsistency (except for the constraint violation) is not problematic, because the only use for this column is for WebUI display to the user!

NOTE: Perl modules are updating (refreshing) the errata by previously removing them all from the channel.

Prey 2

Possible solution: use the same path conventions for the Java and Perl parts. Since using NEVRA names causes conflicts, we should adapt the Java part to use the Perl convention of using full paths copied from rhnPackage.path. This is a good convention because:

  • those paths are unique for each package, despite not having a database constraint on them. That's because:
    • the only point in the whole Java-Python-Perl codebase where those are inserted is during reposync;
    • path contains the package file hash;
    • packages with same hash are deduplicated by reposync itself;
  • those paths are never NULL, except in case of corruption. That's because:
    • reposync always INSERTS INTO rhnPackage with a complete path;
    • the only possible update to rhnPackage.path in the whole Java-Python-Perl codebase is done by spacewalk-data-fsck. Specifically, that command will:
      • UPDATE rhnPackage.path to NULL iff the corresponding file was not found in the filesystem;
      • UPDATE rhnPackage.path to a full path iff the corresponding file was found in the filesystem;
  • those paths do not change unless spacewalk-data-fsck is involved (see above).

Thus we can just copy paths from rhnPackage.path and emit an error if any is NULL (if it is not in the filesystem, it should not be published anyway and user should use spacewalk-data-fsck first). Because of per-package uniqueness and because a same errata should definitely not contain the exact same package twice, conflicts should never happen.

Note that currently the Perl part handles the NULL case in a graceful but conflict-prone way by using the package NEVRA instead of the full path. This is conflict-prone and should be removed.

Proposed battleplan

Right now

  • fix ErrataFactory.publishErrataPackagesToChannel() so that it uses the full path in rhnErrataFile.filename. This should resolve all conflicts;
  • remove limitation in Perl module where path is cut down to just 128 symbols. This is outdated because current database limit (and OS limit) is 4K, and could in some remote cases lead to conflicts;
  • make both code paths fail if the corresponding rhnPackage.path is NULL;
  • Test on reproduced conditions on the test host, release to PTF, upstream;

Possibly later (2.1)

  • limit spacecmd softwarechannel_adderrata and softwarechannel_adderratabydate to cloned channels, that is using *AsOriginal() XMLRPC API. This limits their possible use cases, but makes them safer;
  • prepare a clenaup scripts for duplicate packages published by spacecmd softwarechannel_adderratabydate -p. Possibly run it as a schema upgrade, possibly for 2.1;
  • evaluate adding warnings/errors to avoid users add multiple NEVRA packages in same channel through all possible code paths. Prepare a semi-automatic duplicate nuke tool to cleanup old databases;
  • evaluate reshaping the workflow for spacewalk-clone-by-date, especially the cloned errata reuse algorithm, so that expected files do always get published.

Opened Wormcans

Multiple same-NEVRA packages in same channel

Customers can add multiple packages with the same NEVRA via WebUI, API, commandline tools. We might want to limit this, and if we do, we might want to prepare a semi-automatic dupe-nuking tool. Another possibility would be to just add warnings.

Improper uses of spacecmd softwarechannel_adderrata

spacecmd softwarechannel_adderrata and softwarechannel_adderratabydate can be used to publish Fedora erratas in SLES or vice versa. We might want to limit this or emit warnings.

Strange errata clone semantics

Currently, cloning a channel publishes files (or not) depending on what channels were cloned before that (because cloned errata do not get re-cloned, but are re-used as-is). Although this is by design, it is arguably sloppy design, and quite user-unfriendly.