Google Compute Engine vs. Amazon EC2
 
Support Ukraine

Google Compute Engine vs. Amazon EC2

Last week I spent some time porting our system from Amazon's Elastic Compute Cloud[a] to Google's Compute Engine[b]. It was 57 interesting hours from the start of my free trial to having a running system. This is a summary of my experiences.

1. Executive Summary

Both services are very similar on a conceptual level. Things are called different names, and some things are done in slightly different ways, but by and large you have access to the same things. A lot of the code written for EC2 could either be used as-is or could be ported with minimal changes. In no case did I face a need to re-architect anything. EC2 is the stable system, that provides you with something that is as close to "old style" server deployments as possible; GCE is the bleeding-edge system, that provides you with endless scalability provided that your system architecture can scale out.

2. Rock vs. Meteors

In the executive summary, I describe EC2 as "stable" and GCE as "bleeding-edge", and this is something that can be seen in every aspect of the systems. EC2 has much more complete documentation, covering more use cases and hides more of the challenge of distributed systems behind the API. GCE's documentation is just about enough to get you going - parameter values are left undefined, corner cases are left unexplained, which parameters are optional and which are required are things that you will figure out by trial and error.

But even if EC2 appears like a stable rock and GCE appears to be fragments in free fall: that rock is going nowhere fast, and if you look at them, the fragments are definitely flying along a common trajectory.

The general vibe is that EC2 is about where it will remain. It's not going anywhere. GCE is is constant flux. It's definitely going somewhere.

It's been less than one hundred hours, but my feelings are these: For us, who are actively developing our own system, as opposed to primarily running an off-the-shelf software suite developed by others, Google Compute Engine is our natural home.

3. Google API Client

With the introduction of their JSON API, Google has definitely gone full tilt into Stuff-as-a-Service. The unified API client is a joy to work with, with predictable structure and predictable ways of getting the things you want. This is good, and is something that Google now shares with Amazon, which also has a consistent API.

4. API Documentation

However, the documentation leaves a lot to wish for. Some parameters are not defined well enough - for example, when creating a new instance in GCE via the API, and you want to attach a disk: Do you need to specify the disk type? If so, which values can the parameter take? Googling around for the answer, you find that "PERSISTENT" is a valid option[1]. You also find mentions of "SCRATCH" being a valid option[2], but there is no real description of the differences between these two, or when one is valid but not the other. Then you find out that there is really no difference[e], via a blog post. An enum with possible values, and description of them would have been welcome.

In general, I would wish for more use of enums to define options in cases where the API offers multiple-choice parameters. It minimizes the risk of spelling errors and provides a suitable place for documentation to describe the options and letting the user evaluate the choices.

5. Corner Cases

The API is poor in describing corner cases, especially with regards to the different failure modes that are common in distributed systems. For example, when you batch requests:

  • What if the connection is broken midway through parsing the responses? Are the remaining callbacks called with errors?

  • The class promises that the object is "cleared" after every use:

    Calling execute() executes and clears the queued requests. This means that the BatchRequest object can be reused to queue([...]) and execute() requests again.

    BatchRequest 1.18.0-rc JavaDoc[f]

    But as you can see from reading the code, any IO exception will cause the method to terminate abruptly and leave the object in an indeterminate state, the requestInfos are only cleared on line 278 if the method returns normally:

    ...
        requestInfos.clear();
      }

    BatchRequest.java:278[g]

    If, for example, the batchRequest throws an IOException on line 241, then this line is never reached.

    HttpResponse response = batchRequest.execute();

    BatchRequest:241[h]

  • How are requests retried? The code looks like it will retry requests, but will it retry the whole batch if one fails? What if, as in the example above, all requests are sent, and some responses are received, but the connection breaks before we find out what happened to all of them?

Given that Google Compute Engine trades platform abstraction for performance - not hiding the underlying challenges of running distributed systems so you can code for it and scale out - and because of the huge performance gains you realize by batching, these cases must be well-defined. Even Google admits that TCP connections (and thus the HTTP that all their APIs rely on) aren't reliable in their clusters:

// Turn off stale checking. Our connections break all the time anyway,
// and it's not worth it to pay the penalty of checking every time.

ApacheHttpTransport:82[i]

It's perfectly OK for Google to not handle these errors and pass them up the stack to be handled by the caller. Robust distributed systems rely a lot on client failover (this being much more robust than server failover), and I don't expect anything else. What is not OK is to be ambiguous about what you handle and what you don't handle - and if you handle it, how it is handled.

I spent a lot of time reading the source code to the API client, and I think I understood most of it. But reading the source code provides no guarantees about what in the behavior is part of the API and can be relied upon, and what is purely coincidental and could change at any time.

6. Leaky JSON API Abstraction

The API is generally REST-ful, meaning that it relies a lot on the underlying HTTP protocol to express the intent of a call and to describe the response. For example, if you try to GET information about a GCE instance that has terminated, you get a 404 (not found) error back. This appears in your code as a GoogleJsonResponseException[j], with a GoogleJsonError[k] details object, with the code 404:

// Find out if the instance exists
try {
    // Try to Get the instance
    compute
        .instances ()
        .get (project, zone, instanceName)
        .execute ();
        
    // If we got here, it exists
    // Do something with that fact
} catch (GoogleJsonResponseException ex) {
    if (ex.getDetails () != null && 
        ex.getDetails ().getCode () == 404) {
        // If we got here, it doesn't exist
        // Do something with that fact
    } else {
        // Some other error, rethrow
        throw ex;
    }
}

This is workable, but I would have preferred an InstanceNotFoundException, or even a general NotFoundException. After all, I'm coding against a Java API, not a HTTP API. I would like to have something like this, an exception hierarchy that translates the anonymous JSonException into something that can be reliably detected and understood in Java-land:

/**
 * For all HTTP errors.
 */
class GoogleJsonResponseException 
    extends ... { ... }

/**
 * For HTTP 4xx errors (client's fault)
 */
class ClientException 
    extends GoogleJsonResponseException { ... }

/**
 * For HTTP 404 Not Found errors 
 */
class NotFoundException 
    extends ClientException { ... }

/**
 * For the Compute.Instances collection.
 */
class InstanceNotFoundException 
    extends NotFoundException { ... }

Some would argue that exceptions are exceptional, and that there's really nothing you can do except log them and retry; if your code uses them in the normal control flow you're doing it wrong. But there are two cases where exceptions really matter in this API:

  1. When differentiating between client errors and the rest: Client errors (4xx) means that you should give up. The problem is with your request, not with the remote server or anything else, so you should immediately stop retrying the request.

  2. When finding out if an object exists: There are no methods to test for the existence of an object besides issuing a GET for that object and testing for the 404, delivered by exception, in case it doesn't. While you can issue a List request and look through the list, this is only really applicable when you are testing for the existence of multiple objects.

Footnotes

2014-11-09, updated 2014-11-10