Friday, November 6, 2009

Google's new Closure JavaScript optimizer

I'm pretty excited to see the release, today, of Google's Closure JavaScript compiler.

Closure goes way beyond a simple JavaScript minifier. It can do things like unwind function calls, and replace them with the body of the function (inlining). It also changes local variable names to single characters.

You can either download the compiler locally, or use their web service (though the UI or via a REST API). Here a sample of how aggressively Closure can reduce your code size:

function Foo(string)
{
 alert(string);
}

Foo("hello");

In Simple optimization mode this yields:

function Foo(a) {
 alert(a)
}
Foo("hello");

In Advanced mode this compresses to:

alert("hello");

I'm still learning how to use Closure optimally for some of my code. For example, in Advanced mode, my JavaScript Namespace code is pretty severely compressed. First, Simple optimization yields:

While Advanced Optimization saves a few hundred more bytes, but mangles some variable names that should be left alone as external method names:

There is a tutorial on how to annotate your code to make sure that Advanced optimization does not break your code by applying variable renaming too aggressively.

To fix my Namespace code I add these lines:

// Export names
var p = Namespace.prototype;
p['Extend'] = p.Extend;
p['Define'] = p.Define;
p['Import'] = p.Import;
p['SGlobalName'] = p.SGlobalName;

which then add the following lines to my function in optimized form to restore the "exports" from my class library:

d = f.prototype;
d.Extend = d.d;
d.Define = d.c;
d.Import = d.f;
d.SGlobalName = d.g 

With all these fixes, I'm able to get a clean compile of the Namespace library that compresses down to:

Tuesday, August 4, 2009

How Does Google Count Absolute Unique Visitors?

As a test of the Answer service, Mahalo, I posed the following question:


How does Google Analytics calculate Absolute Unique Visitors?

I know that Google claims that they can report on the number of Absolute Unique Visitors over any time period. What I can't figure out is how they can be calculating this without doing very expensive database queries. I feel they must be making an approximation of some sort.

Otherwise, they would have to query the unique set of users who visited the site across a large time span, and remove duplicates in real time. They could not afford to do this for a site with millions of unique users.

I will reward the tip to the person who best answers this question by providing a feasible solution to the technical problem or explaining how the reported value is approximated. Even better, if it is backed by an authoritative explanation from Google developers.

Note that the crux of the problem is to avoid double-counting Returning Visitors that are duplicately counted across the time span of a report.


Unfortunately, even the best answerer did not understand the question. Perhaps there were not enough users on the site, nor did they have people with the needed expertise to figure out what I was asking.

I rescinded my $5 "tip", and actually got some Mahalo users mad at me for doing so. After given the problem some more thought, this is what I cam up with:


Here's how I would calculate Absolute Unique Visitors:

Data Collection

On the first visit of a user, for each day, I record how many days since their last visit (for the "returning" visitors - as opposed to the "new" visitors).

Data Aggregation

When Analytics is processing the raw data, they can collect buckets of counters for the total number of visitors that:

  • New (never visited before)
  • Visited 1 day ago or more (aka all "returning visitors")
  • Visited 2 days ago or more
  • Visited 3 days ago or more
  • etc. (they may choose to cut off the number of buckets at some reasonable maximum - which would set a max on the reported ranges they could accurately display).

Note that these are cumulative numbers - each bucket has strictly fewer users than the previous one.

Reporting

When the site owner asks for the Absolute Unique Users across a date range, the reporting engine can scan all the dates in the period and accumulate a sum as follows (pseudo-code):

Assumes Data[DAY] containing values:
  NEW - Number of new users who arrived that day
  RETURNING[N] - Number of users who arrived that day with a haitus of N 

days or more.

UNIQUE = 0
for DAY from 1 to N:
  UNIQUE += Data[DAY].NEW
  UNIQUE += Data[DAY].RETURNING[DAY]

UNIQUE is thus, the sum of all NEW users reported on each day (who are always unique), and then only those returning users who were not counted in a prior day (since they were last on the site before the beginning of the reporting period).

Friday, July 10, 2009

JSComposer - A JavaScript Composition Utility for Google AppEngine (Python)

After developing a JavaScript namespace facility, I needed a simple way of merging and/or minifying my javascript source files from my AppEngine (Django) application. So, I developed a simple python module that can be used to:

  1. Merge multiple JavaScript files into one (for faster download).
  2. Minifies your javascript on the server (and stores it in memcache for fast retrieval)
  3. Allows you to include javascript files individually on your test server for easy debugging.

Both the namespace library and jscomposer have been placed in the public domain, so feel free to use as you see fit:

namespace.js
jscomposer.py

I would love to hear from you if you use either of these libraries.

-- Mike

Friday, June 19, 2009

JavaScript Namespaces

One thing that JavaScript programmers have to deal with is corruption of the global namespace. Every time you define a simple function, or other variable at the top level of a web page, the names you've chosen could potentially come in conflict with names used by other developers or libraries that you are using. In the browser, all global variables become properties of the window object.

I've been dealing with this in an ad-hoc manner until recently. I would create a single global variable for all my code, and then define all my functions and variables within it, like this:

var PF = {
  global: value,
  ...,

MyFunc: function(args)
    {
    ...
    },

...
};

I tend to want to migrate code from one project to another quite frequently, so putting all the code in one namespace was becoming quite tedious as I was editing the code to move it into different namespaces for different projects. Inspired by Python, I've developed a more general method of defining and importing namespaces across different modules/namespaces of javascript code.

Here is a typical way to define a new namespace, and import another namespace into it so you can reference code from other libraries succinctly.

global_namespace.Define('startpad.base', function(ns) {
    var Other = ns.Import('startpad.other');

    ns.Extend(ns, {
        var1: value1,
        var2: value2,
        MyFunc: function(args)
            {
            ....Other.AFunction(args)...
            }
    });
       
    ns.ClassName = function(args)
    {
    };
       
    ns.ClassName.prototype = {
        constructor: ns.ClassName,
        var1: value1,
           
    Method1: function(args)
        {
        }
    };
});

The benefits of this approach are:

  • Isolation of code without polluting the global (window) namespace with multiple names. A single global name ('global_namespace') is added to the window object.
  • Easy to import code from another namespace, and assign it a short local name (e.g., 'Other', above).
  • Allow javascript code to be loaded into the browser without regard for execution order. Forward references (to a namespace that hasn't been loaded yet), work fine as the Import function will pre-create a namespace object when it is referenced, and then fill it in when the namespace is defined.
  • Long names can be assigned that are unique using a heirarchy similar to DNS names. E.g., since I own startpad.org, I claim the "startpad" name as a top level global namespace, and can use names like "startpad.base", or "startpad.timer", for libraries that I am building
  • Namespaces can be versioned simply by naming convention. For example, I could load in the same browser, namespaces for "startpad.timer.v1" and "startpad.timer.v2".

There still remains a problem of javascript composition. I don't like to include lots of different script files in the same web page. So you still have to combine the source code from multiple different independent script files into one file. This can be done as part of a build process (along with javascript minification), or through a composition service running on your web server (I hope to write one of these in Python for my AppEngine projects).

I am placing namespace.js into the public domain. Let me know if you end up using it, or have suggestions for improvements.

Wednesday, October 15, 2008

Beware Mutable default value initializers in Python

As a new Python programmer I was surprised by the behavior of default parameter expressions. I knew they were only executed once when the function is first defined, but I hadn't realized the ramifications for using a static dictionary object as a default expression. This little gotcha hit me yesterday and took quite a while for me to figure out what was happening. Here's some sample code:

def Bad(dict={}):
    print "Bad: %s" % repr(dict)
    dict['p'] = 1

Bad()
Bad()

Which results in:

Bad: {}
Bad: {'p': 1}

Since the value of dict is mutable, it can be changed as a side effect of the function. So all subsequent calls will use a default value that has been modified by previous calls to the function! This can be (and was) a nightmare to track down in a large program.

I fixed this by changing to the following:

def Good(dict=None):
    if dict == None:
        dict = {}
    print "Good: %s" % repr(dict)
    dict['p'] = 1

Good()
Good()

Which results in:

Good: {}
Good: {}

The fix de-couples the effects of one call on subsequent calls (as was the intention of the original code).

Wednesday, September 3, 2008

Securing REST API's against "Drive By" Requests

REST protocols are increasingly popular to enable API calls to be sent to web services. These requests can be generated from one server to another, or they can be embeded in client-side web pages. But client-side only requests are inherently unsafe; anyone that can navigate to that page, can have REST calls made against a 3rd party service without their knowledge.

I couldn't find a brief summary of common practices for securing REST API's, so I did some investigation. I reviewed the del.ico.us api to see how they deal with these problems. They have a REST api that, among other things can add and remove bookmarks for a given user. So there needs to be some security so that a 3rd party does not execute this API on behalf of a given user without their permission.

For example, executing this "API" call:

https://api.del.icio.us/v1/posts/add?url=http://mckoss.com&description=security+test
will cause a bookmark to be created in your del.icio.us account. This could even be hidden in a web page as an emdeded <script> tag, so that you would not even be aware that it is happening. Note that POST's are easy to create in the client as well as GET's, so there is not really any added security to using POST over GET.

In order to prevent this, del.icio.us implements two countermeasures:

  1. All calls to their API that modify data require Basic Authentication.
  2. If there is a Referer value in the request header, they return a 403 Forbidden error.
Note that, even if I am logged in to del.icio.us, I will have a authentication cookie on my machine. But this is not sufficient to authenticate to the API - since it uses a different form or authentication. It is possible, however, that a user has previously logged into the del.icio.us API and cached their credentials in the browser. So this protection is a speed-bump, but not fail safe protection.

The second requirement protects users from "drive-by" API calls - the fact that I am visiting a web page should not allow that page to issue requests on my behalf. All mainstream browsers will send a Referer header whenever they make a request from another web page.

One work-around is to download an html page to your local machine, and execute it from your file system. In this case, there is no Referer and the API request is executed. So, a malicious site would have to trick a user into downloading a page and opening the local copy - which is much harder to do. There is a more robust alternative to these measures, that more modern REST servers implement. That is, to require that some user-specific secret information be transmitted along with every request. That way, a random malicious web page will have no way of generating an API call for an unknown potential victim.

The FriendFeed API accomplishes this simply by generating a random RemoteKey for each user. Any service that wishes to make API calls on behalf of the user must include the current value of that user's RemoteKey. In case of a security breach, the user can reset the key (and give the new key to all the services he wishes to continue to use).

It's also not uncommon for web application frameworks to generate hidden session keys within a form that is unique to the session. That way they eliminate the ability for pages to cross-post forms from malicious web sites.

Monday, July 7, 2008

Back from PageForest Haitus

That was a long time between posts. But StartPad.org is now up and running, and in the capable hands of Zach who's dealing with the day-to-day organization, so I'm getting to spend more time on PageForest.

I've also switched my application platform to Google AppEngine - I'm having a grand time learning Python and doing a bunch more JavaScript development.

I've spent quite a bit of time building on my JavaScript unit testing framework. I'm open-sourcing that code, and I've made some significant improvements during PageForest development, so I'll update the core project soon (hosted on Google Code).

I've been working so much in Firefox, I hadn't noticed that IE testing was getting stale. In general, the different browsers are very compatible when it comes to JavaScript - most of the platform differences are in HTML, CSS, and eventing models. The core language is very standardized. But I have hit one major gotcha when it comes to IE JavaScript (aka JScript): handling of empty elements at the end of an array definition.

As a (now) Python programmer, it's very common to define an (open ended) array as:

a = [1,2,3, ];
Note the dangling comma with no trailing element. In Firefox (and Python), this creates a 3 element array - the final comma is ignored. But, in IE, this creates a 4 element array with undefined as the 4th element.

I had used this paradigm extensively, w/o really realizing it. So when I switched to testing on IE, it took me quite a while to track down all the subtle bugs based on extra undefined elements added to my arrays.

JavaScript SHOULD be well enough defined that one of Firefox or IE is INCORRECT in their implementation of arrays (and I hope it's IE - since it's nice to be able to leave placeholders for future array elements in array definitions). Others have experience this problem too; the mozilla JavaScript documentation explicitly allows for it:

If you include a trailing comma at the end of the list of elements, the comma is ignored.
According to my reading of the ECMA-262 spec IE is non-conforming. Section 11.1.4 states:

Whenever a comma in the element list is not preceded by an AssignmentExpression (i.e., a comma at the beginning or after another comma), the missing array element contributes to the length of the Array and increases the index of subsequent elements. Elided array elements are not defined.

so, this definition:

a = [1,,2,];

should be equivalent to:

a = [1,undefined,2]; // Firefox - correct

but NOT:

a = [1,undefined,2,undefined]; // IE - incorrect

Even worse, a trailing comma at the end of an Object initializer is fatal in IE. This statement:

o = {a:1, b:2,};

produces a Runtime Error ("Error: Expected identifier, string or number") and your code will not execute! Here, it may be that IE is actually behaving correctly (according to spec), but it's still rather annoying.

You can see how your browser performs by visiting dangle.htm.