Autophone is an
automated system that executes Python test scripts on real
user hardware, that is, actual phones. It’s been an active project for about
a year now, and we’ve learned a lot about the difficulties
of performing automated performance measurements on hardware that was never
intended for automation. I’m documenting this story for posterity, since it
has been an interesting, if often frustrating, experience. If you want to
follow along, the source is on github.
I’m going to divide this into a few posts to hopefully avoid tl;driness and to
ensure I actually finish at least some of this.
For correctness testing, Mozilla’s Fennec automation is largely done with
development boards, specifically Tegras and Pandas. These boards have
wired power and ethernet, perfect for rack mounting, and all of a given
type generally behave the same.
These boards are not, of course, consumer devices, and,
despite having similar chipsets as real phones, they have different
performance characteristics. To really see how Fennec performs in the real
world, we need to measure Fennec on devices that people are buying and using every day. Thus
was born the Autophone project.
At the moment, the existing Autophone production system runs tests on kind-of-to-very “old” phones. This isn’t a limitation of Autophone; rather, it’s a
sampling of phones that were still very common last year, when the project
got truly underway. We will add newer phones as time progresses, especially
now that the system is very stable. The current system has at least one of
the following phones:
- HTC Google Nexus One
- LG Revolution
- Motorola Droid Pro
- Samsung Galaxy SII
- Samsung Galaxy SIII
- Samsung Google Nexus S
We have a few more phones waiting to be deployed in a second cluster.
A brief discussion of Autophone’s design may help in understanding the
problems in automation and performance measurements by providing some context.
Autophone consists of a main Python process with one worker process per phone.
We used processes instead of threads to isolate certain failures from other
workers. The main process has separate threads for its TCP command
server and for a pulse listener.
The worker processes are each tied to a single device and are responsible
for controlling that device. The devices all have the SUTAgent and Watcher
installed, to which the processes talk via mozdevice. A worker is spawned
when the main process receives a SUT registration message on its command
port from an unknown device. Device info is cached in a JSON file, and
workers are also launched upon startup for any known device.
The workers each listen to their own queue, which receives commands from the
main process. Commands come from users, via the command server, or are
triggered by a build notification from pulse. The workers also check for
jobs every 10 seconds (see below).
Autophone also includes a simple build cache server. This server handles
requests for builds from the workers, fetching them via ftp as necessary,
ensuring that only one copy of a particular build is downloaded at the same
time, and that recently used builds are kept around, subject to space
restrictions. (This part is actually very common to our automation frameworks,
so it really should be extracted and put into its own module. Even better would
be to extend mozdownload to support Fennec and have the build cache server
use that to fetch builds. But I digress.)
When a build notification is received from pulse, or when a user issues
a “triggerjobs” command, for each device an entry containing the build URL and device name is inserted into a sqlite database. A generic “job”
command is then issued to each worker. As mentioned above, the workers
poll this database every 10 seconds. They also poll the database immediately after
executing a command, so the “job” command serves merely to trigger a poll of the database. This mechanism allows for worker processes
to be restarted, since even if a “job” command is missed, the job itself
will be picked up from the database. Along a similar vein, if the whole
system is shut down, the current test will be restarted, and any queued
tests will remain.
When a job successfully completes, the associated entry for that device and
build is deleted from the jobs database. The number of attempts for each job is also recorded, and a job is abandoned after too many attempts, in case there are
unrecoverable problems with a particular build or build/device combination.
Tests themselves are Python classes specified in a ManifestDestiny manifest
file. They are (for better or for worse) executed in the worker process,
i.e., not as a (further) separate process. Test classes are derived from a base
class, PhoneTest, and are pretty much free form, requiring only a runjob()
method that takes a dict of build metadata and the worker subprocess object,
which can be used to manipulate the device as necessary, in particular to
attempt to recover a frozen/disappeared device (though this part should probably be
split into its own object, since a test shouldn’t be messing with the worker process
object). The PhoneTest base class also provides some
convenience functions to push profiles, start fennec, and report status
messages to the main process.
At the moment, we have a single active test, S1S2, which measures Fennec
startup performance. We also have support
for a few unit tests (crashtests, reftests, JS reftests, mochitests, and
robocop), though these are currently disabled pending some stability fixes.
Next post I’ll discuss the goal of S1S2, its challenges, and our solutions.