Leveraging Fingerprinting with WebCenter Sites

Introduction

A common question asked by clients is “how do I create segments for anonymous visitors?”. While tracking anonymous visitors and calculating appropriate segments is a deep subject worthy of its own blog post (check back later!), one must first deal with “how do we uniquely identify a returning visitor who wants to remain anonymous?” before we deal with how to calculate segments for this individual. This last point is made even more complex when we acknowledge that visitors may elect to turn off cookies, remove cookies, mobile devices don’t allow cookies, or that there may even be local or regional laws (e.g. the European Union) that deny cookie usage in the first place.

Main Article

To address the above, a recent surge of “fingerprinting” technology has come to the forefront. Basic web browser configuration information has long been used by various web analytics tools to assist with measuring a visitor’s web traffic/behavior and to prevent various forms of click fraud. With client-side JavaScript, collection of even more esoteric device characteristics is possible. Hashing of such information into a single vector makes for a useful device fingerprint. The Panopticlick project demonstrates that by using Flash or Java plugins browsers can be fingerprinted with a high degree of precision. Others have demonstrated that plugins are not even necessary for general tracking, as each browser’s font list can be detected directly, and the list is browser-independent for both Windows and Mac OS. To quote Panopticlick’s findings (see PDF link below): “In this sample of privacy-conscious users, 83.6% of the browsers seen had an instantaneously unique fingerprint, and a further 5.3% had an anonymity set of size 2. Among visiting browsers that had either Adobe Flash or a Java Virtual Machine enabled, 94.2% exhibited instantaneously unique fingerprints and a further 4.8% had fingerprints that were seen exactly twice. Only 1.0% of browsers with Flash or Java had anonymity sets larger than two. Overall, we were able to place a lower bound on the fingerprint distribution entropy of 18.1 bits, meaning that if we pick a browser at random, at best only one in 286,777 other browsers will share its fingerprint. Our results are presented in further detail in Section 4.” As you can see, such easily-obtained “uniqueness” can be leveraged to provide a richer visitor experience without authenticating and even in situations where cookies are not allowed or are not available.

While fingerprinting technology is fairly nascent, there are already some sites using the technology to identify anonymous visitors. The basic idea is straightforward: there are various characteristics of each computer that taken as a whole tend to make it unique. And these device characteristics are generally available via any modern browser.

Examples of such characteristics include (but are not limited to):

System fonts Mac, Linux, Windows, and mobile devices all use unique fonts. Then when you add your own fonts you further make your device “stand out from the rest”
Operating System Mac, Linux, Windows, Android, iOS, etc,
Cookies enabled? Inferred in HTTP, logged by server
Graphics capability different hardware support different graphics standards
IP Details Description
IP Address Address mapped to location
City Name Geographic name of the city
Country Name Geographic name of the Country
Connection Speed Internet connection speeds or bandwidths (high, medium, low)
Connection Type Describes the data connection between the device or LAN and the internet. See the Connection Type mapping
IP Routing Type Tells how the user is routed to the internet
Carrier Name The name of the entity that manages the ASN entry
ASN Globally unique number assigned to a network or group of networks that is managed by a single entity
Top-level Domain The top-level domain of the URL. For example, .com in www.oracle.com. This is mapped through the Quova reference file.
Second-level Domain The second-level domain of the URL

Now if either Java or Flash are installed, then obtaining a higher degree of probability of uniqueness becomes even more likely. If we then generate a unique hash string/vector per visitor based on these characteristics then we have in essence authenticated the visitor (or at the minimum, we have identified with a high degree of probability a unique computing device which many not be the same as an individual, recognizing that it is common for devices to be sometimes shared among several users). Once we have identified a unique, returning visitor/device, then we are free do all sorts of other things, like record current history and store/retrieve past behaviors that can drive what segments we calculate for this visitor.

A quote from the Forbes article linked below should convince you of the inevitable wide-spread adoption of this technology: “The head of online advertising for a major company said the decay of cookies over time, the growth of mobile phones and different kinds of portable devices, and Apple’s default settings all make fingerprinting the key for future online advertising.”

The typical fingerprinting solution for WCS (or any application for that matter) would implement a few lines of static HTML in the wrapper code to include a Flash shared object and/or image tags to collect additional device characteristics. The Flash code then makes an internal call to the application server thereby uploading the device characteristics. The schema of the database to store such visitor/device data should allow for binding a visitor to multiple devices (and conversely, allow for multiple known “visitors” per device) whenever they explicitly authenticate, for example when they log in. Additionally, if cookies are allowed, we can then combine device characteristics with cookies to enable stitching sessions together — even across devices.

However, even very simple client-side code can reveal all sorts of things about your device. Adding the following calls to ordinary JavaScript libraries in your template might be enough to capture enough data to calculate a fairly high-degree of uniqueness. (Note the use of an Oracle-supplied JavaScript library “deployJava.js”)

<script src="resources/jquery-1.3.2.min.js" type="text/javascript"></script>
<script src="resources/plugin-detect-0.6.3.js" type="text/javascript"></script>
<script src="resources/deployJava.js" type="text/javascript"></script>
<script src="resources/jquery.flash.js" type="text/javascript"></script>

How do you determine which characteristics are viable and appropriate for the visitors to your website/webapplication? That part is fairly easy: add some simple JavaScript collecting code on every page, recording the device characteristics for later analysis (the recording to a db table can be done in batches executed every five minutes if performance is a concern). Once you have enough data collected, you should then be able to identify which device characteristics provide for enough uniqueness. Once this key part is done, you can implement the tracking portion using session vars to keep tabs on the current visitor without have to recalculate on each page. Once the tracking portion is implemented, then adding segment calculation on top of that is relatively straightforward and is basically identical to whatever you would do for any authenticated visitor.

The flow might look like the following:

  • here comes a new user with a device that doesn’t match any other I have in my db — so let’s add a record to the device tracking table, perhaps using a hash/vector as a key — where each device characteristic is assigned a value contributing to the n-dimensional vector. In this way, if a device changes just one characteristic of its device — example: adding a new font — then the new vector will be “close” to the old vector (testable via matrix multiplication) and we have a chance of binding these two records together later in a data-mining/data-cleaning batch process. However, such precision may not be necessary for a given audience!
  • we observe this device’s various site behaviors (e.g. this device appears to like women’s cloths) and calculate a segment for this device, storing it in the device tracking table (and in a cookie if allowed).
  • if a visitor explicitly authenticates (assuming our site has that ability), we can loosely bind the device fingerprint id to the visitor id (recognizing that a known visitor might use multiple devices to access our site).
  • over time our data should show how many visitors use each device, and conversely, how many devices are used by each visitor. As such, we likely will want to have at least two tables that can be joined together for various kinds of reports and analysis.

Creating a session variable that identifies the visitor for the duration of the session can then be used by subsequent page visits such that the fingerprinting code only needs to be executed once per session. Note that there is a practical limit to evaluating potentially endless device characteristics — i.e. we don’t want to create significant latency while such evaluations/calculations are being made. As such, the goal is to always implement the lightest weight code that will get the job done — not necessarily an easy task.

Notwithstanding, I believe that every demo of any Oracle product that purports to be about Experience Management should ship with at the very minimum a lightweight demo of this fingerprinting technique such that developers could then extend it to match their client’s specific requirements.

I encourage you to explore the links at the end of this blog post. Some of these links demonstrate just how unique your device is!! (somewhat scary when you think about it).

Related links:

Add Your Comment