updates with contents generation were super syscall-heavy. for each path
in a package (so at least 2-4, but ordinarily >4) we'd do a db.Put in
ContentsIndex which results in one syscall.Write. so, for every package in
a published repo we'd have to do *at least* 2 but ordinarily >4 syscalls.
this gets abysmally slow very quickly depending on the available system
specs.
instead, start a batch inside each package and finish it when we are done
with the package. this should keep the memory footprint negligible, but
reduce the write() calls from N to 1.
on one of KDE's servers I have seen update publishing of 7600 packages go
from ~28s to ~9s when using batch putting on an HDD.
on my local system the same set of packages go from ~14s to ~6s on an SSD.
(all inodes in cache in both cases)
presently there is no use case where we need this. on the other hand,
passing empty paths into any of the remove methods is indicative of a bug.
this is particularly dangerous as this can temporarily smash the publish
root but later restore it again when actually publishing. this makes
for super nasty and hard to track down problems.
to guard against this simply disallow root dir removal using empty
strings. should we find a use case for this in the future we can always
revisit this (FTR: I think very explicitly API should be used so everyone
knows what is going on and you can't accidentally run it)
the logic here was wrong.
if we managed to find the link target (the physical index file) pointed to
by our old symlink we want to remove it (this is basically "cleaning up old
index" logic).
previously we'd try to only delete it when the ReadLink came back with
error. which had two serious issues with it:
a) linkTarget was empty, so we basically called Remove("") which would
delete the storage -> root <- directory if the root is a symlink!
b) we'd leak old indexes as the cleanup logic only ran if there was en
error which would ordinarily never be
new code correctly cleans up unless there was an error.
this relates to a previous bugfix of readLink which incorrectly returned
absolute paths ultimately rendering the Remove call also broken.
previously it'd return an absolute path which makes the path absolutely
useless as all other functions of PublishedStorage need relative input
and will prepend them with the rootPath, so getting an absolute ReadLink
and then trying to remove that'd would ultimately try to remove the
absolute path `$root/AbsoluteRoot/LinkTarget` instead of `$root/LinkTarget`
add a unit test to actually verify readlink
According to https://tools.ietf.org/html/rfc7231#section-4.3.2 HEAD
must not have response body so the AWS error code NoSuchKey
cannot be received from S3 and we need to fallback to HTTP NotFound
error code.
- new publish calls can now enable AcquireByHash by right away (previously
one would have had to create a new publishing endpoint and then
explicitly switch it to AcquireByHash)
- all json marshals of PublishedRepo now contain AcquireByHash (allows
inspecting if a given endpoint has AcquireByHash enabled already; also
enables verification that a switch/update actually applied a
potential AcquireByHash change
- update all tests to reflect that default state of AcquireByHash
- update creation and switch testing to explicitly toggle AcquireByHash to
make sure state mutation works as expected
LoadComplete() modifies object, so it would cause issues if it runs
concurrently with other methods. Uprage mutex locks to write
locks when LoadComplete() is being used.