Having used both Amazon Web Services[a] and Google Cloud Services[b] - and ported one project from the former to the latter - I have come up with some lessons learned. Some of these are generally useful for any kind of distributed system, some are specific to running on rented hardware.
Table of Contents
1. Company Policy is Leaky
All providers have a published API that is more or less documented. Statistically, you will get up and running. What they don't have, is documentation for "their way of doing things", even though this is something that permeates the service. Amazon, for example has a way of "doing computing", and you will see this reflected in the services they offer and in how you are expected to interact with them. So does Google - the services they offer, and the way you are expected to use them for maximum effect is closely linked with you understanding the "Google way" of doing things.
In short, you are at least in spirit an employee at your IaaS provider.
This doesn't mean that you must keep your "spiritual boss" at the IaaS provider happy, but if you want to get things done you must decode and understand the underlying policies, priorities and attitudes of the business that you in turn are running your business on.
2. Abstractions are Leaky
On any IaaS-provider, you will be running in a huge server farm. Things will break, and even if some providers try to provide more of a stable environment[c], while others will just let you have all the chaos their server farm can produce[d], raw and unfiltered, running on an IaaS provider is different from having your own server. The underlying uncertainties must be accounted for.
I don't know if they still do, but it used to be that sysadmins would compete to see who had the longest uptime. Anyway, in the IaaS world, that is all over. You will have a server uptime measured in hours, but a fantastic service uptime.
3. Abstract, Abstract, Abstract - and Then Abstract Some More
Assume that you will switch service provider. Even if you won't, the knowledge that all your interactions with the service go through your own API first provides a good delineation between what is yours and what is theirs.
4. Use a High-Level SIMD Abstraction
The service provider typically provides a database service, something that looks like a file system, and a way to stat and stop server instances. When coding against these, express your API in a batch-friendly way. For example, don't provide an API to delete one file. Provide an API to delete any file matching a pattern. This will enable your implementation to batch requests to the service, reducing the number of roundtrips and resulting in a lot less time waiting.
Every provider will have a slightly different way of batching requests, different guidelines for concurrent requests, and different ways in which requests can be combined into batches. Keeping the operation parameters at a high level enables you to translate the request into the format best suited for the service.
...and you know what? Providers change their guidelines every now and then. Having a single place to update to take advantage of recent optimizations is good.
5. Integrate Your Build System
Just because the provider runs a certain operating system on certain hardware, it doesn't mean that your code will run on that VM. You may have to compile code on that VM. Having a build system that can start up an instance and compile things on it, means that you don't have to rely on maintaining a local build environment that you can never be sure really matches what you'll run on.