Missed by filter list: tracking detection

In our study, we propose a tracking detection method inspired by analyzing behavior of invisible pixels. By crawling 84,658 webpages from 8,744 domains, we detect that third-party invisible pixels are widely deployed: they are present on more than 94.51% of domains and constitute 35.66% of all third-party images. We propose a fine-grained behavioral classification of tracking based on the analysis of invisible pixels. We use this classification to detect new categories of tracking and uncover new collaborations between domains on the full dataset of 4,216,454 third-party requests.

We demonstrate that two popular methods to detect tracking, based on EasyList & EasyPrivacy and on Disconnect lists respectively miss 25.22% and 30.34% of the trackers that we detect. Moreover, we find that if we combine all three lists, 379,245 requests originated from 8,744 domains still track users on 68.70% of websites.

Read the paper »

Authors: Imane Fouad, Nataliia bielova , Arnaud legout ( Inria, Université Côte d'Azur, France) and Natasa Sarafijanovic-Djukic (RIS Technology Group, Barcelona, Spain)

Classification of tracking behaviors

The long tail of third parties found on first party sites

We performed a stateful crawl of Alexa top 10,000 domains in February 2019 in France. Out of 10,000 Alexa top domains, we successfully crawled 8,744 domains with a total of 84,658 pages.

Findings

In total, we detected 747,816 third party requests leading to invisible images. By analyzing these requests, we detected 6 categories of different tracking behaviors in 636,053 (85.05%) requests that lead to these invisible images.

After defining our classification using the invisible pixels dataset, we apply it on the full dataset. Out of 8,744 crawled domains, we identified at least one form of tracking in 91.92% domains.

First to third party cookie syncing

Causes of mixed-content warnings on the top 1M sites

The Figure demonstrates the cookie syncing of the first-party cookie. The first party domain site.com includes a content from A.com?id=abcd, where A.com is a third party and abcd is the first party identifier of the user set for site.com. A.com receives the first party cookie abcd in the URL parameters, and then redirects the request to B.com. As part of the request redirected to B.com, A.com includes the first party identifier. B.com sets its own identifier 1234 in the user's browser. Using these two identifiers (the first party's identifier abcd received in the URL parameter and its own identifier 1234 sent in the cookie), B.com can create a matching table that allows B.com to link both identifiers to the same user.

The first party cookie can also be shared directly by the first party service (imagine the Figure without A.com). In that case, site.com includes content from B.com and as part of the request sent to B.com, site.com sends the first party identifier 1234. B.com sets its own identifier 1234 in the user's browser. B.com can now link the two identifiers to the same user.

We detected first to third party cookie syncing in 67.96% of visited domains. To read more about the tracking behaviors, please check the paper

Content used for tracking

We found that not all the tracking detected is based on invisible pixels. We extracted the type of the content served by the tracking requests using the HTTP header Content-Type.

The Table presents the top 5 types of content used for tracking. Out of the 2,724,020 requests involved in at least one tracking behavior in the full dataset, the top content delivered by tracking requests is scripts (34.36%), while the second most common content is invisible images (23.34%). We also detected other content used for tracking purposes such as visible images.

Third parties combine privacy-invasive tracking and analytics behaviors

The variance of trackers across the alexa categories

we analyzed the most prevalent domains involved in either cross-site tracking, analytics, or both behaviors. The Figure demonstrated that a third party domain may have several behaviors. For example, we detect that google-analytics.com exhibits both cross-site tracking and analytics behavior.

This variance of behaviors is either chosen by the developer, as it's the case for cookie syncing and analytics behaviors, or, caused by other partners as it's the case of cookie forwarding. Google-analytics in that case is included by another third party, the developer is not necessarily aware of this practice.

Most of the state-of-the-art works that aim at measuring trackers at large scale rely on filter lists. In particular, EasyList and EasyPrivacy and Disconnect lists became the de facto approach to detect third-party tracking requests in the privacy and measurement communities.
Therefore, it is interesting to evaluate how effective are filter lists at detecting trackers, how many trackers are missed by the research community in their studies, and whether filter lists should still be used as the default tools to detect trackers at scale.

Comparison to disconnect list

Example canvas image from a fingerprinting script

We apply filter lists on requests to detect which requests are blocked by the lists. We classify a request as blocked if it matches one of the conditions: (1) the requests directly matches the list, (2) the request is a consequence of a redirection chain where an earlier request was blocked by the list or (3) the request is loaded in a third-party content (an iframe) that was blocked by the list (we detect this case by analyzing the referrer header).

We found that 16.30% of tracking requests were missed by Disconnect. The number of third party domains involved in the tracking detected only by PixelTrack is 6,189. PixelTrack detects all kind of trackers including the less popular ones that are under the bar of detection of filter list. Because less popular trackers are less prevalent, they generate fewer requests and therefore remain unnoticed by filter lists. This is the reason why we detect a large fraction of domains responsible for tracking.

Full list of tracking requests missed by disconnect

Tracking enabled by useful content

Examples of two AudioContext configurations

We analyzed the type of content provided by the remaining tracking requests. The Table presents the top content types used for tracking and not blocked by the filter lists. We refer to images with dimensions larger than 50×50 pixels as Big images. These kinds of images, texts, fonts and even stylesheets are used for tracking.

The use of these types of contents is essential for the proper functioning of the website. That makes the blocking of responsible requests by the filter lists impossible. In fact, the lists are explicitly allowing content from some of these trackers to avoid the breakage of the website, as it's the case for cse.google.com.

Comparison to browser extensions

we analyzed how effective are the popular privacy protection extensions in blocking the privacy leaks detected by PixelTrack. We study the following extensions: Adblock, Ghostery, Disconnect, and Privacy Badger.

Our results show that Ghostery is the most efficient among them. However, it still fails to block 26.72% of the tracking requests. All extensions miss trackers in the three classes, However, Disconnect and Privacy Badger have an efficient Analytics blocking mechanism: they are missing Analytics behavior on only 0.31% and 0.18% of the pages respectively. Most tracking requests missed by the extensions are performing Explicit tracking.

Imane Fouad	`imane.fouad@inria.fr`
Nataliia Bielova	`nataliia.bielova@inria.fr`

About

Tracking Results