Commit Graph

420 Commits

Author SHA1 Message Date
André Roth
2a07494910 fix unit tests 2025-02-15 23:49:21 +01:00
André Roth
174bdc2b5e fix golangci-lint errors 2025-02-15 23:49:21 +01:00
Ryan Gonzalez
19a705f80d Split reflists to share their contents across snapshots
In current aptly, each repository and snapshot has its own reflist in
the database. This brings a few problems with it:

- Given a sufficiently large repositories and snapshots, these lists can
  get enormous, reaching >1MB. This is a problem for LevelDB's overall
  performance, as it tends to prefer values around the confiruged block
  size (defaults to just 4KiB).
- When you take these large repositories and snapshot them, you have a
  full, new copy of the reflist, even if only a few packages changed.
  This means that having a lot of snapshots with a few changes causes
  the database to basically be full of largely duplicate reflists.
- All the duplication also means that many of the same refs are being
  loaded repeatedly, which can cause some slowdown but, more notably,
  eats up huge amounts of memory.
- Adding on more and more new repositories and snapshots will cause the
  time and memory spent on things like cleanup and publishing to grow
  roughly linearly.

At the core, there are two problems here:

- Reflists get very big because there are just a lot of packages.
- Different reflists can tend to duplicate much of the same contents.

*Split reflists* aim at solving this by separating reflists into 64
*buckets*. Package refs are sorted into individual buckets according to
the following system:

- Take the first 3 letters of the package name, after dropping a `lib`
  prefix. (Using only the first 3 letters will cause packages with
  similar prefixes to end up in the same bucket, under the assumption
  that packages with similar names tend to be updated together.)
- Take the 64-bit xxhash of these letters. (xxhash was chosen because it
  relatively good distribution across the individual bits, which is
  important for the next step.)
- Use the first 6 bits of the hash (range [0:63]) as an index into the
  buckets.

Once refs are placed in buckets, a sha256 digest of all the refs in the
bucket is taken. These buckets are then stored in the database, split
into roughly block-sized segments, and all the repositories and
snapshots simply store an array of bucket digests.

This approach means that *repositories and snapshots can share their
reflist buckets*. If a snapshot is taken of a repository, it will have
the same contents, so its split reflist will point to the same buckets
as the base repository, and only one copy of each bucket is stored in
the database. When some packages in the repository change, only the
buckets containing those packages will be modified; all the other
buckets will remain unchanged, and thus their contents will still be
shared. Later on, when these reflists are loaded, each bucket is only
loaded once, short-cutting loaded many megabytes of data. In effect,
split reflists are essentially copy-on-write, with only the changed
buckets stored individually.

Changing the disk format means that a migration needs to take place, so
that task is moved into the database cleanup step, which will migrate
reflists over to split reflists, as well as delete any unused reflist
buckets.

All the reflist tests are also changed to additionally test out split
reflists; although the internal logic is all shared (since buckets are,
themselves, just normal reflists), some special additions are needed to
have native versions of the various reflist helper methods.

In our tests, we've observed the following improvements:

- Memory usage during publish and database cleanup, with
  `GOMEMLIMIT=2GiB`, goes down from ~3.2GiB (larger than the memory
  limit!) to ~0.7GiB, a decrease of ~4.5x.
- Database size decreases from 1.3GB to 367MB.

*In my local tests*, publish times had also decreased down to mere
seconds but the same effect wasn't observed on the server, with the
times staying around the same. My suspicions are that this is due to I/O
performance: my local system is an M1 MBP, which almost certainly has
much faster disk speeds than our DigitalOcean block volumes. Split
reflists include a side effect of requiring more random accesses from
reading all the buckets by their keys, so if your random I/O
performance is slower, it might cancel out the benefits. That being
said, even in that case, the memory usage and database size advantages
still persist.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
2025-02-15 23:49:21 +01:00
André Roth
666b5c9700 Merge pull request #1422 from aptly-dev/fix/empty-mirror-snapshot
Allow snapshotting empty mirrors
2025-01-13 12:36:01 +01:00
Mikel Olasagasti Uranga
7074fc8856 Switch to google/uuid module
Current used github.com/pborman/uuid hasn't seen any updates in years.

Signed-off-by: Mikel Olasagasti Uranga <mikel@olasagasti.info>
2025-01-11 23:18:50 +01:00
André Roth
aa0830ff0c Revert "fix empty mirror check"
This reverts commit 09a44ba409.
2025-01-11 19:17:28 +01:00
Gordian Schoenherr
8c3fe8dabb Fix failing system test
The fix of the -with-filter flag causes the following previously
missing source files to be downloaded, so I updated the test file.

```
rkward_0.7.5-1~bullseyecran.0.debian.tar.xz
rkward_0.7.5-1~bullseyecran.0.dsc
rkward_0.7.5.orig.tar.gz
rpy2_3.5.12-1~bullseyecran.0.debian.tar.xz
rpy2_3.5.12-1~bullseyecran.0.dsc
rpy2_3.5.12.orig.tar.gz
```
2024-12-10 11:52:55 +09:00
Gordian Schoenherr
ef6815222c Add unit tests for filtering with source packages 2024-12-09 13:17:41 +09:00
Gordian Schoenherr
0c76677b16 Fix -with-sources not downloading differently named sources
Such as e.g. downloading 'glibc' when the sources for 'libc6'
are requested.
2024-12-09 13:17:41 +09:00
Gordian Schoenherr
3b785e4165 Refactor Filter options into a struct
It was already a lot of options for one method and I am going to add
another one in the next commit.
2024-12-09 13:17:41 +09:00
Christoph Fiehe
7d9f020ae8 Fix null pointer when dropping a multi dist published repo.
Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-12-02 15:09:46 +01:00
André Roth
e2cbd637b8 use new azure-sdk 2024-11-17 17:43:20 +01:00
André Roth
9ca9569714 fix build and golangci-lint 2024-11-17 14:09:37 +01:00
Mauro Regli
1357d246d8 rename addon files to skel files 2024-11-17 14:09:37 +01:00
Mauro Regli
c75c2c7594 pass down addonpath from api and cmd context 2024-11-17 14:09:37 +01:00
Mauro Regli
17186b0c73 add GetAddonPaths to publish file 2024-11-17 14:09:37 +01:00
Mauro Regli
2aac7baf52 add AddonIndex to index_files
I had to remove "signable: false" (line 399), since that property
doesn't exist.
2024-11-17 14:09:37 +01:00
André Roth
0936922172 only allow mirrors with architectures set 2024-11-08 17:07:37 +01:00
André Roth
62a0a1a560 log error 2024-11-08 17:07:37 +01:00
André Roth
e642847a82 log filtering error 2024-11-08 17:07:37 +01:00
André Roth
26c14e218a fix lint 2024-11-08 17:07:37 +01:00
André Roth
26c775ccfd fix test
flat repos may have architecture which is needed for filtering dependencies
2024-11-08 17:07:37 +01:00
André Roth
d6284148f9 set Architectures from flat mirror
note: 'Architecture' is not official, but used by nvidia mirrors for no debian arch 'x86_64'. shold this be supported ?
2024-11-08 17:07:37 +01:00
André Roth
4c58266a87 do not set empty mirror architectures for flat mirrors 2024-11-08 17:07:37 +01:00
5hir0kur0
c8fca7953c package.go: Fix bug in providesDependency
Use package version if `Provides:` entry does not specify a version.
2024-11-08 15:55:01 +01:00
André Roth
eafec74c29 allow to exclude provided packages from list.Search 2024-11-04 17:02:54 +01:00
Christoph Fiehe
f8f28e9554 Fixing tests and fix cleanup.
Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-10-22 16:58:15 +02:00
Christoph Fiehe
ac5ecf946d Cleanup improved and code redundant code removed.
Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-10-22 16:58:15 +02:00
Christoph Fiehe
d87d8bac92 Fix test cases.
Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-10-22 16:58:15 +02:00
Christoph Fiehe
14c29ff912 Fixing tests.
Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-10-22 16:58:15 +02:00
Christoph Fiehe
bd64232eb6 Allow management of components
This commit allows to add, remove and update components of published repositories without the need to recreate them.

Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-10-22 16:58:15 +02:00
André Roth
75ca51b23b improve error message 2024-10-10 12:03:13 +02:00
André Roth
861260198a publish: persist multidist flag 2024-10-08 22:28:12 +02:00
Christoph Fiehe
4195ad90bc Allow to add a new component to a published repo
This commit modifies the behavior of the publish switch method in the way, that also new components can be added to an already published repository. It is no longer necessary to drop and recreate the whole publish.

Signed-off-by: Christoph Fiehe <c.fiehe@eurodata.de>
2024-09-24 15:43:27 +02:00
5hir0kur0
d2332e6452 Log a warning for errors in MatchesDependency 2024-08-11 12:35:46 +02:00
André Roth
1428f54a02 make compatible with go 1.19 2024-08-11 12:35:46 +02:00
André Roth
feb87c0f19 Revert "Remove errors.Join usage for go1.19 compatibility"
This reverts commit 1339e35dd785fff114549e027d81cbe47a882e27.
2024-08-11 12:35:46 +02:00
5hir0kur0
934fa0598b Remove errors.Join usage for go1.19 compatibility 2024-08-11 12:35:46 +02:00
5hir0kur0
6d6761e234 Add unit tests for Provides entries with version 2024-08-11 12:35:46 +02:00
5hir0kur0
ab18d4835b Support version relation in Provides entries 2024-08-11 12:35:46 +02:00
André Roth
09a44ba409 fix empty mirror check 2024-07-24 21:19:47 +02:00
5hir0kur0
02bdb7c76a Deduplicate missing dependency list 2024-07-11 18:25:49 +02:00
5hir0kur0
8d537b4e3e Fix bug in dependency resolution 2024-07-11 18:25:49 +02:00
André Roth
3a286ae07f fix unit tests 2024-07-03 18:08:58 +02:00
André Roth
a93ccd4100 fix tests 2024-07-03 18:08:58 +02:00
André Roth
c1f7e5fe96 handle GpgDisableVerify and ignore-signatures consistently
and be less verbose
2024-07-03 18:08:58 +02:00
André Roth
d16110068c allow not signed mirrors without InRelease file 2024-07-03 18:08:58 +02:00
Noa Resare
b4cd86aa14 Introduce option multi-dist to the publish commands
This change makes it possible to publish multiple distributions
with packages named the same but with different content by changing
structure of the generated pool hierarchy. The option not enabled
by default as this changes the structure of the output which could
break the expectations of other tools.
2024-06-15 11:27:26 +02:00
Ryan Gonzalez
79975bf2b6 Fix reflist diffs failing to compact when one of the inputs ends
The previous reflist logic would early-exit the loop body if one of the
lists was empty, but that skips the compacting logic entirely.

Instead of doing the early-exit, we can leave a list's ref as nil when
the list end is reached and then flip the comparison result, which will
essentially treat it as being greater than all others. This should
preserve the general behavior without omitting the compaction.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
2024-04-24 17:36:36 +02:00
Ryan Gonzalez
8d09c202db Skip loading reflists when listing published repos
The output doesn't actually depend on the reflists, and loading them for
every published repo starts to take substantial time and memory.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
2024-04-24 17:35:44 +02:00