Production fuzzing: Incorporating pre-production and production usage into the quality and testing effort
In the routine operation of production systems, discovering anomalies such as segfaults is a common occurrence. A script that monitors the logs has been written countless times.
Some more examples:
- For a website that takes orders, a check that verifies that when a successful order event is published that order ID can be found in all the appropriate tables.
- Running FSCK continuously to verify that the file system is not corrupted while working on a filesystem.
- Leak detectors that find hung containers that have not exited or zombie processes.
Historically, many projects and organizations have developed tools and monitoring solutions to address obvious cases of these issues. However, the automation of this process has been sporadic, lacking a formal discussion as part of the software testing regimen.
Production fuzzing relies on users, testers, or any other systems executing the code to provide input rather than a specific set of inputs, or generating inputs. Production fuzzing monitors the properties of the system and reports issues and unexpected behaviors found by this type of input for engineers to fix.
Key Characteristics of Production Fuzzing:
- Serves primarily as a way for engineers to identify and address issues, not just a tool for operations.
- Operates continuously, rather than being a one-time or limited-time activity associated with feature releases or rollouts, ensuring ongoing monitoring and improvement.
- Environment agnostic, running in various environments to take advantage of their different usage patterns to detect different issues.
- Monitors the system's properties to detect categories of errors. This is different from smoke or unit-test-style checks, where it verifies that highly specific inputs result in very predetermined outputs.
In the process of doing "shift-right testing" or "Observability engineering" or even simply "really good monitoring" one might have done production fuzzing, but there was no guarentee that it would occur.
By naming this pattern and defining its key characteristics it can be more easily incorporated into quality and testing efforts.