aptly

mirror of https://github.com/aptly-dev/aptly.git synced 2026-06-02 04:50:49 +00:00

Author	SHA1	Message	Date
Ryan Gonzalez	19a705f80d	Split reflists to share their contents across snapshots In current aptly, each repository and snapshot has its own reflist in the database. This brings a few problems with it: - Given a sufficiently large repositories and snapshots, these lists can get enormous, reaching >1MB. This is a problem for LevelDB's overall performance, as it tends to prefer values around the confiruged block size (defaults to just 4KiB). - When you take these large repositories and snapshot them, you have a full, new copy of the reflist, even if only a few packages changed. This means that having a lot of snapshots with a few changes causes the database to basically be full of largely duplicate reflists. - All the duplication also means that many of the same refs are being loaded repeatedly, which can cause some slowdown but, more notably, eats up huge amounts of memory. - Adding on more and more new repositories and snapshots will cause the time and memory spent on things like cleanup and publishing to grow roughly linearly. At the core, there are two problems here: - Reflists get very big because there are just a lot of packages. - Different reflists can tend to duplicate much of the same contents. Split reflists aim at solving this by separating reflists into 64 buckets. Package refs are sorted into individual buckets according to the following system: - Take the first 3 letters of the package name, after dropping a `lib` prefix. (Using only the first 3 letters will cause packages with similar prefixes to end up in the same bucket, under the assumption that packages with similar names tend to be updated together.) - Take the 64-bit xxhash of these letters. (xxhash was chosen because it relatively good distribution across the individual bits, which is important for the next step.) - Use the first 6 bits of the hash (range [0:63]) as an index into the buckets. Once refs are placed in buckets, a sha256 digest of all the refs in the bucket is taken. These buckets are then stored in the database, split into roughly block-sized segments, and all the repositories and snapshots simply store an array of bucket digests. This approach means that repositories and snapshots can share their reflist buckets. If a snapshot is taken of a repository, it will have the same contents, so its split reflist will point to the same buckets as the base repository, and only one copy of each bucket is stored in the database. When some packages in the repository change, only the buckets containing those packages will be modified; all the other buckets will remain unchanged, and thus their contents will still be shared. Later on, when these reflists are loaded, each bucket is only loaded once, short-cutting loaded many megabytes of data. In effect, split reflists are essentially copy-on-write, with only the changed buckets stored individually. Changing the disk format means that a migration needs to take place, so that task is moved into the database cleanup step, which will migrate reflists over to split reflists, as well as delete any unused reflist buckets. All the reflist tests are also changed to additionally test out split reflists; although the internal logic is all shared (since buckets are, themselves, just normal reflists), some special additions are needed to have native versions of the various reflist helper methods. In our tests, we've observed the following improvements: - Memory usage during publish and database cleanup, with `GOMEMLIMIT=2GiB`, goes down from ~3.2GiB (larger than the memory limit!) to ~0.7GiB, a decrease of ~4.5x. - Database size decreases from 1.3GB to 367MB. In my local tests, publish times had also decreased down to mere seconds but the same effect wasn't observed on the server, with the times staying around the same. My suspicions are that this is due to I/O performance: my local system is an M1 MBP, which almost certainly has much faster disk speeds than our DigitalOcean block volumes. Split reflists include a side effect of requiring more random accesses from reading all the buckets by their keys, so if your random I/O performance is slower, it might cancel out the benefits. That being said, even in that case, the memory usage and database size advantages still persist. Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>	2025-02-15 23:49:21 +01:00
André Roth	26c775ccfd	fix test flat repos may have architecture which is needed for filtering dependencies	2024-11-08 17:07:37 +01:00
André Roth	3a286ae07f	fix unit tests	2024-07-03 18:08:58 +02:00
André Roth	a93ccd4100	fix tests	2024-07-03 18:08:58 +02:00
André Roth	c1f7e5fe96	handle GpgDisableVerify and ignore-signatures consistently and be less verbose	2024-07-03 18:08:58 +02:00
Ryan Gonzalez	8cb1236a8c	Improve publish cleanup perf when sources share most of their packages The cleanup phase needs to list out all the files in each component in order to determine what's still in use. When there's a large number of sources (e.g. from having many snapshots), the time spent just loading the package information becomes substantial. However, in many cases, most of the packages being loaded are actually shared across the sources; if you're taking frequent snapshots, for instance, most of the packages in each snapshot will be the same as other snapshots. In these cases, re-reading the packages repeatedly is just a waste of time. To improve this, we maintain a list of refs that we know were processed for each component. When listing the refs from a source, only the ones that have not yet been processed will be examined. Some tests were also added specifically to check listing the files in a component. With this change, listing the files in components on a copy of our production database went from >10 minutes to ~10 seconds, and the newly added benchmark went from ~300ms to ~43ms. Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>	2024-04-24 16:46:16 +02:00
Markus Muellner	8e62195eb5	implement structured logging	2023-02-20 13:42:50 +01:00
Markus Muellner	352f4e8772	update golangci-lint and replace deprecated calls to io/ioutil	2022-12-12 10:21:39 +01:00
Andrey Smirnov	77d7c3871a	Consistently use transactions to update database For any action which is multi-step (requires updating more than 1 DB key), use transaction to make update atomic. Also pack big chunks of updates (importing packages for importing and mirror updates) into single transaction to improve aptly performance and get some isolation. Note that still layers up (Collections) provide some level of isolation, so this is going to shine with the future PRs to remove collection locks. Spin-off of #459	2019-08-11 00:11:53 +03:00
Andrey Smirnov	67e38955ae	Refactor database code to support standalone batches, transactions. This is spin-off of changes from #459. Transactions are not being used yet, but batches are updated to work with the new API. `database/` package was refactored to split abstract interfaces and implementation via goleveldb. This should make it easier to implement new database types.	2019-08-09 00:46:40 +03:00
Andrey Smirnov	f0a370db24	Rework HTTP downloader retry logic Apply retries as global, config-level option `downloadRetries` so that it can be applied to any aptly command which downloads objects. Unwrap `errors.Wrap` which is used in downloader. Unwrap `*url.Error` which should be the actual error returned from the HTTP client, catch more cases, be more specific around failures.	2019-08-07 20:23:05 +03:00
Oliver Sauder	e23e30eb44	Merge branch 'master' into with_installer	2018-09-21 13:26:15 +02:00
Andrey Smirnov	699323e2e0	Reimplement DB collections for mirrors, repos and snapshots See #765, #761 Collections were relying on keeping in-memory list of all the objects for any kind of operation which doesn't scale well the number of objects in the database. With this rewrite, objects are loaded only on demand which might be pessimization in some edge cases but should improve performance and memory footprint signifcantly.	2018-08-21 01:08:14 +03:00
Oliver Sauder	b1a2523ef0	Add unit test for remote and http	2018-07-06 15:02:37 +02:00
Oliver Sauder	108b0ea226	Add support to mirror non package installer files	2018-07-06 15:02:37 +02:00
Andrey Smirnov	b8c5303fdb	Fix paths after repository transfer to aptly-dev	2018-04-18 21:19:43 +03:00
Andrey Smirnov	0e9f966dd1	Fix up other code to support new GPG provider structure	2017-07-21 01:01:58 +03:00
Andrey Smirnov	211ac0501f	Rework the way database is open/re-open in aptly Allow database to be initialized without opening, unify all the open paths to retry on failure. In API router make sure open requests are matched with acks in explicit way. This also enables re-open attempts in all the aptly commands, so it should make running aptly CLI much easier now hopefully. Fix up system tests for oldoldstable ;)	2017-07-05 00:17:48 +03:00
Andrey Smirnov	12a6b0ceb8	Merge pull request #575 from smira/pgp-refactoring Refactor GPG signer/verifier	2017-05-24 19:24:38 +03:00
Andrey Smirnov	cafb89f30f	Re-work the way checksum matching works against `Release` file Break up URL into base part and relative path. Match checksum against relative path and never against full URL. This might be fixing security issue if aptly was incorrectly matching against wrong part of Release file.	2017-05-23 03:00:15 +03:00
Andrey Smirnov	1be8d39105	Refactor GPG signer/verifier Goal is to make it easier to plug in another implementation.	2017-05-23 02:54:56 +03:00
Andrey Smirnov	51213899b7	More Go linters enabled, issues fixed Ref: #528 Enables "staticcheck", "varcheck", "structcheck", "aligncheck"	2017-05-03 18:23:14 +03:00
Andrey Smirnov	186bb2dff0	Add flag to disable/enable support for legacy pool paths Legacy pool paths are enabled by default, but for new aptly installations (when aptly config is first generated), it would be disabled explicitly.	2017-04-26 23:37:31 +03:00
Andrey Smirnov	5dd11a2ec2	Pull original packages when skipping existing packages	2017-04-26 23:17:04 +03:00
Andrey Smirnov	10c096fbb6	Update all other pieces for the CheckumStorage and Verify	2017-04-26 23:17:04 +03:00
Andrey Smirnov	c40025a335	Add progress bar on package saving progress	2017-04-26 23:17:03 +03:00
Andrey Smirnov	bc7903f86e	Rework mirror update (download packages) implementation `PackageDownloadTask` is just a reference to file now. Whole process was rewritten to follow pattern: download to temp location inside the pool, verify/update checksums, import into pool as final step. This removes a lot of edge cases when aptly internal state might be broken if updating from rogue mirror. Also this changes whole memory model: package list/files are kept in memory now during the duration of `mirror update` command and saved to disk only in the end.	2017-04-26 23:17:03 +03:00
Clemens Rabe	aa16899c60	Adaption of tests.	2017-03-24 06:25:46 +01:00
Clemens Rabe	16a0d0d428	Added option --skip-existing-packages to speed up mirror update.	2017-03-23 22:01:11 +01:00
Clemens Rabe	66f51d2b17	Added option --skip-existing-packages to speed up mirror update.	2017-03-23 21:55:22 +01:00
Andrey Smirnov	11d828b3b1	Add govet/golint into Travis CI build Fix current issues	2017-03-22 21:49:16 +03:00
Raphael Medaer	bfb9ffad1d	Added expected error on 'Packages.xz' for TestDownload[WithSources]Flat.	2017-03-16 22:41:25 +03:00
Oliver Sauder	f31b5ec3f8	Adjusted test with new maxTries param for download	2016-11-28 17:02:24 +01:00
Andrey Smirnov	aa53b8da15	Go 1.6.	2016-04-18 12:47:00 +03:00
Andrey Smirnov	7bb052ac37	Fix unit-tests. #324	2015-12-24 14:08:37 +03:00
Andrey Smirnov	84801bce78	Fix unit-tests Go 1.5 has different error message, randomize port number in test to avoid collisions.	2015-09-22 11:18:57 +03:00
Andrey Smirnov	8ca07d9acd	Fix unit tests. #71	2015-03-18 22:10:49 +03:00
Andrey Smirnov	8e20daa927	Refactor out IsClearSigned to separate method. #71	2015-03-13 18:42:34 +03:00
Andrey Smirnov	903d4cefba	gofmt -s	2015-02-22 14:29:09 +03:00
Chris Read	daf887e54f	Upgrade gocheck	2014-11-05 13:27:15 -06:00
Andrey Smirnov	7be2ef8b85	Don't fallback between compression methods available unless we get strictly HTTP 404. #129 #125 Prior to that, some real errors could have been masked away by that fallback.	2014-10-24 08:51:04 +04:00
Andrey Smirnov	a356f3dff9	Marking RemoteRepo as being updated, with worker PID, checking for locks. #45 #114	2014-10-03 01:32:19 +04:00
Andrey Smirnov	a0870f6726	Refactor mirror download code, split it into separate methods. #45 #114	2014-10-02 19:30:37 +04:00
Andrey Smirnov	7ad1bb387b	Support for .udeb downloads from remote mirrors. #108	2014-09-25 19:34:16 +04:00
Andrey Smirnov	ce1df9447d	Support for filters in RemoteRepo: filtering mirror contents by query. #62	2014-07-16 02:27:29 +04:00
Andrey Smirnov	10bbefeb25	Fix support for flat format repositories in subdirectories with common pool. #47	2014-05-10 16:56:50 +04:00
Andrey Smirnov	ff045f9a48	Fixups after renaming debian -> deb. #21	2014-04-07 21:22:58 +04:00
Andrey Smirnov	fd662c9275	Rename debian -> deb. #21	2014-04-07 21:15:13 +04:00

48 Commits