Commit Graph

9 Commits

Author SHA1 Message Date
Nick Bozhenko 463c34a38e Fix race conditions and improve etcd timeout handling
This commit addresses several critical race conditions and improves the reliability
of etcd operations through better timeout and retry handling.

## Race Condition Fixes

1. **Task Resource Management Bug**
   - Fixed incorrect variable usage in task/list.go:78
   - Was using completed task's resources instead of idle task's resources
   - This caused resource conflicts and potential deadlocks

2. **Database Channel Initialization**
   - Added sync.Once pattern to ensure thread-safe channel initialization
   - Prevents panic from concurrent access during startup
   - Created initDBRequests() function for safe initialization

3. **Published Storage Double-Checked Locking**
   - Implemented double-checked locking pattern in GetPublishedStorage
   - Reduces lock contention while preventing concurrent initialization
   - Improves performance for frequently accessed storage

4. **File Operation Synchronization**
   - Created FileLockRegistry in utils/filelock.go
   - Prevents concurrent file operations (create, rename, delete, link)
   - Implements deadlock prevention for multi-file operations
   - Critical for preventing file corruption during parallel publishes

5. **WaitGroup Miscount Prevention**
   - Added defer pattern to ensure Done() is always called
   - Protects against panics during task execution
   - Prevents "negative WaitGroup counter" errors

## etcd Improvements

1. **Timeout Protection**
   - Replaced global context.TODO() with per-operation timeout contexts
   - Default timeout: 60 seconds (configurable)
   - Prevents indefinite hangs when etcd is unresponsive

2. **Environment Variable Configuration**
   - APTLY_ETCD_TIMEOUT: Operation timeout (default: 60s)
   - APTLY_ETCD_DIAL_TIMEOUT: Connection timeout (default: 60s)
   - APTLY_ETCD_KEEPALIVE: Keep-alive timeout (default: 7200s)
   - APTLY_ETCD_MAX_MSG_SIZE: Max message size (default: 50MB)

3. **Retry Logic for Read Operations**
   - Get operations retry up to 3 times with exponential backoff
   - Only retries on temporary/network errors
   - Improves reliability without risking data inconsistency

4. **Enhanced Error Logging**
   - All etcd errors now logged with operation context
   - Replaces silent failures with actionable error messages
   - Improves debugging and monitoring capabilities

5. **Increased Message Size Limits**
   - Default increased from 10MB to 50MB
   - Configurable via environment variable
   - Prevents "message too large" errors for large operations

## Testing

- Added comprehensive tests for etcd timeout functionality
- Tests verify context timeout, retry logic, and configuration
- All existing tests pass with the new implementation

## Documentation

- Updated README.rst with etcd configuration section
- Documented all environment variables and their defaults
- Added examples and feature descriptions

These changes significantly improve the reliability and debuggability of aptly
when using etcd as the database backend, while also fixing critical race
conditions that could cause data corruption or service crashes.
2025-07-10 10:05:49 -04:00
André Roth f7057a9517 go1.24: fix lint, unit and system tests
- development env: base on debian trixie with go1.24
- lint: run with default config
- fix lint errors
- fix unit tests
- fix system test
2025-04-26 13:29:50 +02:00
Mikel Olasagasti Uranga 7074fc8856 Switch to google/uuid module
Current used github.com/pborman/uuid hasn't seen any updates in years.

Signed-off-by: Mikel Olasagasti Uranga <mikel@olasagasti.info>
2025-01-11 23:18:50 +01:00
André Roth 0b3dd2709b apply PR feedback 2024-07-31 22:16:00 +02:00
André Roth 67771795ca etcd: implement transactions
- use temporary db for lookups in transactions
- use batch implementation to commit transaction
2024-07-31 22:16:00 +02:00
André Roth 7a01c9c62d etcd: implement batch operations
- cache the operations internally in a list
- Write() applies the list to etcd
2024-07-31 22:16:00 +02:00
André Roth 9768ecef22 etcd: implement temporary db support
- temporary db support is implemented with a unique key prefix
- prevent closing etcd connection when closing temporary db
2024-07-31 22:16:00 +02:00
André Roth 5b74f82edb etcd: fix int overflow
goxc fails with:

Error: database/etcddb/database.go:17:25: cannot use 2048 * 1024 * 1024 (untyped int constant 2147483648) as int value in struct literal (overflows)
2024-07-31 22:16:00 +02:00
hudeng 78172d11d7 feat: Add etcd database support
improve concurrent access and high availability of aptly with the help of the characteristics of etcd
2024-07-31 22:16:00 +02:00