Commit Graph

46 Commits

Author SHA1 Message Date
Ryan Gonzalez 19a705f80d Split reflists to share their contents across snapshots
In current aptly, each repository and snapshot has its own reflist in
the database. This brings a few problems with it:

- Given a sufficiently large repositories and snapshots, these lists can
  get enormous, reaching >1MB. This is a problem for LevelDB's overall
  performance, as it tends to prefer values around the confiruged block
  size (defaults to just 4KiB).
- When you take these large repositories and snapshot them, you have a
  full, new copy of the reflist, even if only a few packages changed.
  This means that having a lot of snapshots with a few changes causes
  the database to basically be full of largely duplicate reflists.
- All the duplication also means that many of the same refs are being
  loaded repeatedly, which can cause some slowdown but, more notably,
  eats up huge amounts of memory.
- Adding on more and more new repositories and snapshots will cause the
  time and memory spent on things like cleanup and publishing to grow
  roughly linearly.

At the core, there are two problems here:

- Reflists get very big because there are just a lot of packages.
- Different reflists can tend to duplicate much of the same contents.

*Split reflists* aim at solving this by separating reflists into 64
*buckets*. Package refs are sorted into individual buckets according to
the following system:

- Take the first 3 letters of the package name, after dropping a `lib`
  prefix. (Using only the first 3 letters will cause packages with
  similar prefixes to end up in the same bucket, under the assumption
  that packages with similar names tend to be updated together.)
- Take the 64-bit xxhash of these letters. (xxhash was chosen because it
  relatively good distribution across the individual bits, which is
  important for the next step.)
- Use the first 6 bits of the hash (range [0:63]) as an index into the
  buckets.

Once refs are placed in buckets, a sha256 digest of all the refs in the
bucket is taken. These buckets are then stored in the database, split
into roughly block-sized segments, and all the repositories and
snapshots simply store an array of bucket digests.

This approach means that *repositories and snapshots can share their
reflist buckets*. If a snapshot is taken of a repository, it will have
the same contents, so its split reflist will point to the same buckets
as the base repository, and only one copy of each bucket is stored in
the database. When some packages in the repository change, only the
buckets containing those packages will be modified; all the other
buckets will remain unchanged, and thus their contents will still be
shared. Later on, when these reflists are loaded, each bucket is only
loaded once, short-cutting loaded many megabytes of data. In effect,
split reflists are essentially copy-on-write, with only the changed
buckets stored individually.

Changing the disk format means that a migration needs to take place, so
that task is moved into the database cleanup step, which will migrate
reflists over to split reflists, as well as delete any unused reflist
buckets.

All the reflist tests are also changed to additionally test out split
reflists; although the internal logic is all shared (since buckets are,
themselves, just normal reflists), some special additions are needed to
have native versions of the various reflist helper methods.

In our tests, we've observed the following improvements:

- Memory usage during publish and database cleanup, with
  `GOMEMLIMIT=2GiB`, goes down from ~3.2GiB (larger than the memory
  limit!) to ~0.7GiB, a decrease of ~4.5x.
- Database size decreases from 1.3GB to 367MB.

*In my local tests*, publish times had also decreased down to mere
seconds but the same effect wasn't observed on the server, with the
times staying around the same. My suspicions are that this is due to I/O
performance: my local system is an M1 MBP, which almost certainly has
much faster disk speeds than our DigitalOcean block volumes. Split
reflists include a side effect of requiring more random accesses from
reading all the buckets by their keys, so if your random I/O
performance is slower, it might cancel out the benefits. That being
said, even in that case, the memory usage and database size advantages
still persist.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
2025-02-15 23:49:21 +01:00
Gordian Schoenherr 8c3fe8dabb Fix failing system test
The fix of the -with-filter flag causes the following previously
missing source files to be downloaded, so I updated the test file.

```
rkward_0.7.5-1~bullseyecran.0.debian.tar.xz
rkward_0.7.5-1~bullseyecran.0.dsc
rkward_0.7.5.orig.tar.gz
rpy2_3.5.12-1~bullseyecran.0.debian.tar.xz
rpy2_3.5.12-1~bullseyecran.0.dsc
rpy2_3.5.12.orig.tar.gz
```
2024-12-10 11:52:55 +09:00
Gordian Schoenherr 0c76677b16 Fix -with-sources not downloading differently named sources
Such as e.g. downloading 'glibc' when the sources for 'libc6'
are requested.
2024-12-09 13:17:41 +09:00
Gordian Schoenherr 3b785e4165 Refactor Filter options into a struct
It was already a lot of options for one method and I am going to add
another one in the next commit.
2024-12-09 13:17:41 +09:00
André Roth e2cbd637b8 use new azure-sdk 2024-11-17 17:43:20 +01:00
André Roth 0936922172 only allow mirrors with architectures set 2024-11-08 17:07:37 +01:00
André Roth 62a0a1a560 log error 2024-11-08 17:07:37 +01:00
André Roth e642847a82 log filtering error 2024-11-08 17:07:37 +01:00
André Roth eafec74c29 allow to exclude provided packages from list.Search 2024-11-04 17:02:54 +01:00
André Roth 75ca51b23b improve error message 2024-10-10 12:03:13 +02:00
5hir0kur0 ab18d4835b Support version relation in Provides entries 2024-08-11 12:35:46 +02:00
5hir0kur0 02bdb7c76a Deduplicate missing dependency list 2024-07-11 18:25:49 +02:00
5hir0kur0 8d537b4e3e Fix bug in dependency resolution 2024-07-11 18:25:49 +02:00
Oliver Sauder f09a273ad7 Add publish output progress counting remaining number of packages 2022-01-27 09:30:14 +01:00
Joshua Colson 0f1575d5af Add -json flag to snapshot show|list
Signed-off-by: Joshua Colson <joshua.colson@gmail.com>
2021-09-24 10:29:33 +02:00
Andrey Smirnov b8c5303fdb Fix paths after repository transfer to aptly-dev 2018-04-18 21:19:43 +03:00
Andrey Smirnov e5198178a5 Fix incomplete dependencies with follow-all-variants
When `-dep-follow-all-variants` option is enabled, dependency resolving
process shouldn't stop even if dependency is already satisfied - there
mgiht be other ways to satisfy dependency.

Also fix issue with parsing multiarch specs like
`python:any`.
2017-09-26 00:09:15 +03:00
Andrey Smirnov 6d2f265980 Prefer exact match on package name over provides match
When searching for packages which might satisfy given dependency,
aptly was first returning packages which `Provides` mentioned
name. By default aptly is picking up only first match (unless
follow all variants options is enabled), so `Provides:` takes
precedence over exact package name match.

Invert this logic by searching first for package name match.
2017-09-25 18:24:45 +03:00
Andrey Smirnov 9ca81ff3bc Fix lint warning & add Go 1.9 to the mix 2017-09-15 22:54:39 +03:00
Andrey Smirnov 470165a419 Enable goconst & interfacer linters 2017-05-17 00:53:10 +03:00
Andrey Smirnov 51213899b7 More Go linters enabled, issues fixed
Ref: #528

Enables "staticcheck", "varcheck", "structcheck", "aligncheck"
2017-05-03 18:23:14 +03:00
Andrey Smirnov 85b4a8b1ae Add new option for detailed logging on dependency resolving
This adds command-line arg and config option, with option enabled
aptly is more verbose on internal depeendency resolving cycles:

```
Missing dependencies: file-rc (>= 0.8.16) [amd64], python:any (>= 2.7.1-0ubuntu2) [amd64], python3:any (>= 3.3.2-2~) [amd64], file-rc [amd64], perl (<< 5.17) [amd64], iptables-router (>= 1.2.3) [amd64], systemd [amd64], sgml-base (>= 1.26+nmu2) [amd64], sed (>= 4.1.2-8) [amd64]
Unsatisfied dependency: file-rc (>= 0.8.16) [amd64]
Unsatisfied dependency: python:any (>= 2.7.1-0ubuntu2) [amd64]
Unsatisfied dependency: python3:any (>= 3.3.2-2~) [amd64]
Unsatisfied dependency: file-rc [amd64]
Unsatisfied dependency: perl (<< 5.17) [amd64]
Unsatisfied dependency: iptables-router (>= 1.2.3) [amd64]
Unsatisfied dependency: systemd [amd64]
Injecting package: sgml-base_1.26+nmu4ubuntu1_all
Injecting package: sed_4.2.2-4ubuntu1_amd64
```
2017-03-28 22:58:07 +03:00
Andrey Smirnov 11d828b3b1 Add govet/golint into Travis CI build
Fix current issues
2017-03-22 21:49:16 +03:00
Andrey Smirnov 587086beb4 Misc style and simple mistakes fixes. 2016-03-28 13:34:05 +03:00
Andrey Smirnov ff52d2655a Fix package search missing duplicate packages. #225
Implement package list with duplicate entries, use it when
initiating search from PackageCollection.
2016-03-22 12:23:13 +03:00
Andrey Smirnov 403c7272cd When loading package index for the mirror, ignore duplicate packages (and print about them). #183 2015-01-31 21:27:26 +03:00
Andrey Smirnov 7beb90d4fc Strings() for PackageList: turning list into sequence of package Ids. #116 2014-11-14 00:19:58 +03:00
Andrey Smirnov 608870265c New common interface for PackgeCollection & PackageList: PackageCatalog. #80 2014-08-28 19:41:30 +04:00
Andrey Smirnov 5a42c60af4 Simplify and make more deterministic algorithm for dependency pulling. #100 2014-08-28 19:05:32 +04:00
Andrey Smirnov 9bee7cdd08 Simplify dependency verification code. #81 2014-08-27 00:03:46 +04:00
Andrey Smirnov 2a7a2de84a Allow empty source in PackageList.Filter. #64 2014-07-16 02:23:55 +04:00
Andrey Smirnov fb660efeb5 Make list sort really stable: if all properties match, use architecture
as sort key.
2014-07-14 18:51:07 +04:00
Andrey Smirnov 34f545b8cf Package index may not be indexed when schanning. 2014-07-12 18:02:13 +04:00
Andrey Smirnov e08d44ff0a Fix bug with matching in Search method. 2014-07-12 13:58:22 +04:00
Andrey Smirnov c485cf41f7 PkgQueries, concept of 'Searchable', rewrite Filter using PackageQueries. 2014-07-12 00:15:33 +04:00
Andrey Smirnov eef49516ef Fix bugs after style fixes. 2014-07-10 23:31:53 +04:00
Andrey Smirnov 9af10bc422 Refactoring: simplification. #70 2014-07-10 00:43:42 +04:00
Andrey Smirnov bdbb5acb11 Remove 'allMatches' on version equal. #70 2014-07-10 00:20:28 +04:00
Simon Aquino e1348ab88f Merge smira/master into pull_multiple_packages
Resolved conflicts arisen following smira's new commits into master.
2014-06-28 01:34:28 +01:00
Andrey Smirnov 44ce4c8a77 Insert into right position when adding as well. #67 2014-06-28 00:26:56 +04:00
Simon Aquino 3cf281965b Implementation of all-matches functionality + tests
When performing an *aptly snapshot pull*, users might list dependency
versions that can potentially match multiple packages in the source
snapshot. However, the current implementation of the 'snapshot pull'
command only allows one package to be pulled from a snapshot at a time
for a given dependency.

The newly implemented all-matches flag allows users to pull all the
matching packages from a source snapshot, provided that they satisfy the
version requirements indicated by the dependencies.

The all-matches flag defaults to false and only produces the described
behaviour when it is explicitly set to true.
2014-06-27 03:36:03 +01:00
Simon Aquino e19a615641 In PackageList, sort the package version from latest to oldest
This enables us to return the latest version of a package when no version is specified rather than the oldest.
Therefore, we don't need to modify the search algorithm to return the latest version of a package.
2014-06-13 12:06:33 +01:00
Simon Aquino ff77fbf5d9 Sort PackagesList by name and version 2014-06-13 11:42:20 +01:00
Andrey Smirnov a1e360b07b Conflict detection for packages in one list. #60 2014-05-29 18:01:07 +04:00
Andrey Smirnov ff045f9a48 Fixups after renaming debian -> deb. #21 2014-04-07 21:22:58 +04:00
Andrey Smirnov fd662c9275 Rename debian -> deb. #21 2014-04-07 21:15:13 +04:00