Commit Graph

78 Commits

Author SHA1 Message Date
Ryan Gonzalez 19a705f80d Split reflists to share their contents across snapshots
In current aptly, each repository and snapshot has its own reflist in
the database. This brings a few problems with it:

- Given a sufficiently large repositories and snapshots, these lists can
  get enormous, reaching >1MB. This is a problem for LevelDB's overall
  performance, as it tends to prefer values around the confiruged block
  size (defaults to just 4KiB).
- When you take these large repositories and snapshot them, you have a
  full, new copy of the reflist, even if only a few packages changed.
  This means that having a lot of snapshots with a few changes causes
  the database to basically be full of largely duplicate reflists.
- All the duplication also means that many of the same refs are being
  loaded repeatedly, which can cause some slowdown but, more notably,
  eats up huge amounts of memory.
- Adding on more and more new repositories and snapshots will cause the
  time and memory spent on things like cleanup and publishing to grow
  roughly linearly.

At the core, there are two problems here:

- Reflists get very big because there are just a lot of packages.
- Different reflists can tend to duplicate much of the same contents.

*Split reflists* aim at solving this by separating reflists into 64
*buckets*. Package refs are sorted into individual buckets according to
the following system:

- Take the first 3 letters of the package name, after dropping a `lib`
  prefix. (Using only the first 3 letters will cause packages with
  similar prefixes to end up in the same bucket, under the assumption
  that packages with similar names tend to be updated together.)
- Take the 64-bit xxhash of these letters. (xxhash was chosen because it
  relatively good distribution across the individual bits, which is
  important for the next step.)
- Use the first 6 bits of the hash (range [0:63]) as an index into the
  buckets.

Once refs are placed in buckets, a sha256 digest of all the refs in the
bucket is taken. These buckets are then stored in the database, split
into roughly block-sized segments, and all the repositories and
snapshots simply store an array of bucket digests.

This approach means that *repositories and snapshots can share their
reflist buckets*. If a snapshot is taken of a repository, it will have
the same contents, so its split reflist will point to the same buckets
as the base repository, and only one copy of each bucket is stored in
the database. When some packages in the repository change, only the
buckets containing those packages will be modified; all the other
buckets will remain unchanged, and thus their contents will still be
shared. Later on, when these reflists are loaded, each bucket is only
loaded once, short-cutting loaded many megabytes of data. In effect,
split reflists are essentially copy-on-write, with only the changed
buckets stored individually.

Changing the disk format means that a migration needs to take place, so
that task is moved into the database cleanup step, which will migrate
reflists over to split reflists, as well as delete any unused reflist
buckets.

All the reflist tests are also changed to additionally test out split
reflists; although the internal logic is all shared (since buckets are,
themselves, just normal reflists), some special additions are needed to
have native versions of the various reflist helper methods.

In our tests, we've observed the following improvements:

- Memory usage during publish and database cleanup, with
  `GOMEMLIMIT=2GiB`, goes down from ~3.2GiB (larger than the memory
  limit!) to ~0.7GiB, a decrease of ~4.5x.
- Database size decreases from 1.3GB to 367MB.

*In my local tests*, publish times had also decreased down to mere
seconds but the same effect wasn't observed on the server, with the
times staying around the same. My suspicions are that this is due to I/O
performance: my local system is an M1 MBP, which almost certainly has
much faster disk speeds than our DigitalOcean block volumes. Split
reflists include a side effect of requiring more random accesses from
reading all the buckets by their keys, so if your random I/O
performance is slower, it might cancel out the benefits. That being
said, even in that case, the memory usage and database size advantages
still persist.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
2025-02-15 23:49:21 +01:00
Mikel Olasagasti Uranga 7074fc8856 Switch to google/uuid module
Current used github.com/pborman/uuid hasn't seen any updates in years.

Signed-off-by: Mikel Olasagasti Uranga <mikel@olasagasti.info>
2025-01-11 23:18:50 +01:00
Gordian Schoenherr 0c76677b16 Fix -with-sources not downloading differently named sources
Such as e.g. downloading 'glibc' when the sources for 'libc6'
are requested.
2024-12-09 13:17:41 +09:00
Gordian Schoenherr 3b785e4165 Refactor Filter options into a struct
It was already a lot of options for one method and I am going to add
another one in the next commit.
2024-12-09 13:17:41 +09:00
André Roth 0936922172 only allow mirrors with architectures set 2024-11-08 17:07:37 +01:00
André Roth 26c14e218a fix lint 2024-11-08 17:07:37 +01:00
André Roth d6284148f9 set Architectures from flat mirror
note: 'Architecture' is not official, but used by nvidia mirrors for no debian arch 'x86_64'. shold this be supported ?
2024-11-08 17:07:37 +01:00
André Roth 4c58266a87 do not set empty mirror architectures for flat mirrors 2024-11-08 17:07:37 +01:00
André Roth a93ccd4100 fix tests 2024-07-03 18:08:58 +02:00
André Roth d16110068c allow not signed mirrors without InRelease file 2024-07-03 18:08:58 +02:00
André Roth a0af6a25aa fix lint complaints 2024-04-03 10:13:24 +02:00
Robert LeBlanc c2ee086487 Fix the installer path for Ubuntu Focal
Ubuntu has started depreciating the Debian installer in focal
and moved the installer images to a different path. In versions
after focal, they are completly removed. This basically gives
us more time to figure out how to use the new system.
2024-04-03 10:13:24 +02:00
André Roth 72a7780054 fix golint complaints 2024-03-06 06:21:36 +01:00
Mauro Regli ae61706a34 Fix: Implement golangci-lint suggestions 2023-09-21 11:25:18 +02:00
Lorenzo Bolla 2fa3adee1d Don't use transactions when direct db access is enough
For read-only action transactions are not necessary and they risk to deadlock
if multiple go-routines try to read the database.
2022-01-27 09:30:14 +01:00
Oliver Sauder f09a273ad7 Add publish output progress counting remaining number of packages 2022-01-27 09:30:14 +01:00
Oliver Sauder 25d7d7c037 Solving progress not safe issue for api
Progress is not safe so for api its always nil and
code needs to take care of this
2022-01-27 09:30:14 +01:00
Oliver Sauder 1c7c07ace7 db batch may not be a global resource
This way db usage is safe.
2022-01-27 09:30:14 +01:00
Oliver Sauder f7f42a9cd8 Database changes of resources need to be atomic 2022-01-27 09:30:14 +01:00
Oliver Sauder 1e7731c317 Removed obsolete RWMutexes 2022-01-27 09:30:14 +01:00
Sylvain Baubeau ab2f5420c6 Export RemoteRepo package list 2021-11-02 15:08:19 +01:00
Joshua Colson 129eb8644d Add -json flag to mirror list|show
Signed-off-by: Joshua Colson <joshua.colson@gmail.com>
2021-09-24 10:29:33 +02:00
Joshua Colson 0f1575d5af Add -json flag to snapshot show|list
Signed-off-by: Joshua Colson <joshua.colson@gmail.com>
2021-09-24 10:29:33 +02:00
Andrey Smirnov 769e984ef4 Fix issues with progress == nil causing panics
Part of PR #459

This prepares for more methods to be exposed via the API.
2019-09-03 20:28:28 +04:00
Andrey Smirnov 77d7c3871a Consistently use transactions to update database
For any action which is multi-step (requires updating more than 1 DB
key), use transaction to make update atomic.

Also pack big chunks of updates (importing packages for importing and
mirror updates) into single transaction to improve aptly performance and
get some isolation.

Note that still layers up (Collections) provide some level of isolation,
so this is going to shine with the future PRs to remove collection
locks.

Spin-off of #459
2019-08-11 00:11:53 +03:00
Andrey Smirnov f0a370db24 Rework HTTP downloader retry logic
Apply retries as global, config-level option `downloadRetries` so that
it can be applied to any aptly command which downloads objects.

Unwrap `errors.Wrap` which is used in downloader.

Unwrap `*url.Error` which should be the actual error returned from the
HTTP client, catch more cases, be more specific around failures.
2019-08-07 20:23:05 +03:00
Shengjing Zhu 906cbf1e6f Fix time.Time msgpack decoding backwards compatibility
See https://github.com/ugorji/go-codec/issues/269
2019-07-15 21:51:09 +03:00
Andrey Smirnov 3b5840e248 Fix linter list and fix errors discovered by new staticcheck 2019-01-20 00:01:17 +03:00
Oliver Sauder e23e30eb44 Merge branch 'master' into with_installer 2018-09-21 13:26:15 +02:00
Andrey Smirnov 699323e2e0 Reimplement DB collections for mirrors, repos and snapshots
See #765, #761

Collections were relying on keeping in-memory list of all the objects
for any kind of operation which doesn't scale well the number of
objects in the database.

With this rewrite, objects are loaded only on demand which might
be pessimization in some edge cases but should improve performance
and memory footprint signifcantly.
2018-08-21 01:08:14 +03:00
Andrey Smirnov 0f4bbc4752 Implement lazy iteration (ForEach) over collections
See #761

aptly had a concept of loading small amount of info per each object
into memory once collection is accessed for the first time.

This might have simplified some operations, but it doesn't scale well
with huge aptly databases.

This is just intermediate step towards better memory management -
list of objects is not loaded unless some method is called.
`ForEach` method (mainly used in cleanup) is reimplemented to
iterate over database without ever loading all the objects into memory.

Memory was even worse with previous approach, as for each item usually
`LoadComplete()` is called, which pulls even more data into memory
and item stays in memory till the end of the iteration as it is referenced
from `collection.list`.

For the subsequent PR: reimplement `ByUUID()` and probably other methods
to avoid loading all the items into memory, at least for all the collecitons
except for published repos. When published repository is being loaded, it
might pull source local repo which in turn would trigger loading for all the
local repos which is not acceptable.
2018-08-04 00:26:02 +03:00
Oliver Sauder 6df4a746f1 Clarify doc strings 2018-07-06 15:02:37 +02:00
Oliver Sauder 108b0ea226 Add support to mirror non package installer files 2018-07-06 15:02:37 +02:00
aviau 814ac6c28c dep: use official uuid package 2018-06-21 16:12:45 -04:00
Andrey Smirnov b8c5303fdb Fix paths after repository transfer to aptly-dev 2018-04-18 21:19:43 +03:00
Andrey Smirnov 15618c8ea8 Use Go context to abort gracefully mirror updates
There are two fixes here:

1. Abort package download immediately as ^C is pressed.
2. Import all the already downloaded files into package pool,
so that next time mirror is updated, aptly won't download them
once again.
2017-11-30 00:49:37 +03:00
Oliver Sauder 5d301fb1b7 Prepare archive root when editing it 2017-11-27 11:08:31 +01:00
Felix Abecassis e682639b20 Handle SHA512 in Release files
Fix: #656
2017-11-08 11:54:19 +03:00
Harald Sitter 46c2182ade fix linting by using new maligned linter instead of aligncheck
upstream switched the alignment check backend and in doing so fails to run
if the old backend is defined in the config.

also skip alignment linting on a struct we use for byte decoding as we have
no choice in its member order.
2017-10-31 12:24:31 +01:00
Andrey Smirnov 0e9f966dd1 Fix up other code to support new GPG provider structure 2017-07-21 01:01:58 +03:00
Andrey Smirnov c13eb99925 Style fixups 2017-07-05 00:36:48 +03:00
Andrey Smirnov 12a6b0ceb8 Merge pull request #575 from smira/pgp-refactoring
Refactor GPG signer/verifier
2017-05-24 19:24:38 +03:00
Andrey Smirnov cafb89f30f Re-work the way checksum matching works against Release file
Break up URL into base part and relative path. Match checksum against relative path
and never against full URL.

This might be fixing security issue if aptly was incorrectly matching against
wrong part of Release file.
2017-05-23 03:00:15 +03:00
Andrey Smirnov 1be8d39105 Refactor GPG signer/verifier
Goal is to make it easier to plug in another implementation.
2017-05-23 02:54:56 +03:00
Andrey Smirnov 470165a419 Enable goconst & interfacer linters 2017-05-17 00:53:10 +03:00
Andrey Smirnov 51213899b7 More Go linters enabled, issues fixed
Ref: #528

Enables "staticcheck", "varcheck", "structcheck", "aligncheck"
2017-05-03 18:23:14 +03:00
Andrey Smirnov bae3f949b4 Enable gosimple and ineffasign linters 2017-04-27 18:34:30 +03:00
Andrey Smirnov 5dd11a2ec2 Pull original packages when skipping existing packages 2017-04-26 23:17:04 +03:00
Andrey Smirnov 10c096fbb6 Update all other pieces for the CheckumStorage and Verify 2017-04-26 23:17:04 +03:00
Andrey Smirnov 4eedb62418 Fix misspelling 2017-04-26 23:17:04 +03:00