Understanding Metadata

Earlier on the site, I cited a statistic that 87% of the web is encrypted. This means that when you visit, say Facebook, that your Internet Service Provider (ISP) can see that you visited and how long you hung out for, but they can’t see your login credentials (username and password) or which exact pages you went to. This is done with the use of Transport Layer Security, or TLS, a powerful and increasingly popular encryption protocol used online. It’s quite effective and difficult to break.

So in effect even the average person has - generally speaking - a basic level of powerful security in their online lives (which is why I listed installing HTTPS Everywhere as "Most Important). This begs the question that privacy enthusiasts everywhere have come to despise like nails on a chalkboard: “why should I care?” If your sensitive details such as password and credit card number are safely encrypted, who cares if your ISP or the Starbucks IT guy can see what websites you visit? (Spoiler alert: the introduction.)

For starters, because TLS breaks down at the end point. When you connect to Amazon, your ISP can see that you visited Amazon, but not what you bought or your card number. Amazon, however, can see it all without restriction. But more importantly, often you don't need to see the content itself to start making powerful and dangerous inferences.

What is Metadata?

This information in question is called “metadata,” sometimes described as “data about the data.” Maybe I can’t see exactly what you said in your email, but I can see who you emailed, what time, and the size of the email. And on the surface it doesn’t seem so bad. Who cares if you know that I emailed my mom at 7pm and the email was 7KB?

As is the case with most privacy and security concerns in the modern era, the problem isn’t so much what’s collected but rather how it has the potential to be used. Take this excellent article from the Electronic Frontier Foundation, for example. A couple examples they list of metadata that has the potential to be too revealing include:

  • They know you called a gynecologist, spoke for a half hour, and then called the local Planned Parenthood's number later that day. But nobody knows what you spoke about.
  • They know you got an email from an HIV testing service, then called your doctor, then visited an HIV support group website in the same hour. But they don't know what was in the email or what you talked about on the phone.
  • They know you called the suicide prevention hotline from the Golden Gate Bridge. But the topic of the call remains a secret.

    (Lifted directy from EFF's Surveillance Self Defense page)

As you can see, metadata has the potential to be just as revealing as content itself, and therefore should be protected just as much as the actual data. You might say to yourself, “You said potential abuse, do you really think that’s likely?” The answer is absolutely, 100% without a doubt, not-just-being-paranoid: "yes." China is already notorious for their incredibly invasive, 1984-like “Social Credit System.” The United States is starting to implement the use of your social network in insurance industries. Oh, and the United States is working on their own “Social Credit System” too. So yeah, metadata is an important part of your attack surface that you need to consider as you protect your privacy and security.

So What to Do?

There's no surefire or one-size-fits-all solution to protecting your metadata. It depends, as with most things on this site, on what you're using and who you're trying to hide from. It's safe to assume that any digital action creates metadata, so if your threat level is high enough, don't trust any digital medium. This is one reason that Edward Snowden chose to deliver his documents in person. However, if it is safe or necessary to use digital communications, there are two general methods of handling metadata: ephemeral and obfuscation.

Ephemeral metadata refers to metadata that is not logged and therefore - in theory - goes away after a certain period of time. For example, reputable VPN providers and messengers delete metadata very quickly and only use it as needed to make the service work. This is desirable but should not always be trusted. For example, a sophisticated enough adversary can watch your traffic in real time and record the metadata before it even goes to the service provider or log the metadata the provider collectes before it disappears. This is unlikely unless your threat level is very high, but it is possible. Instead, ephemeral metadata should be used in conjuction with obfuscation of metadata.

Obfuscation of metadata refers to metadata that has been changed to give off false or misleading information. A good example is using a VPN or Tor browser to access a website: the website now thinks your IP address is that of the VPN provider or exit node. However, this is actually much trickier than it first appears and requires a more expanded knowledge of the types of metadata collected. For example, some apps and sites might collect your MAC address. On computers these are fairly easy to randomize and manipulate. On phones, not so much. So even if you use a VPN on your phone, your phone's IMEI - a unique number that can't be changed similar to a serial number or MAC address - is often still be collected by multiple apps, thereby identifying your phone across each service. This is one reason I encourage using your phone as little as possible. You also have to consider other permanent identifiers, such as usernames. If two usernames are repeatedly communicating with each other on a service, even if the IP and content change that log of communication can still be revealing. This is where ephemeral metadata comes back into the picture: a service that doesn't keep logs won't have records of two services communicating. Again, this is all very complicated and requires a lot of thought.

Most of us probably don’t need to be 100% anonymous for any reason, but it's a good idea for us to protect our metadata just as much as our actual communications whenever possible. I wish I had some concrete advice, but instead it simply comes down to asking yourself “what metadata am I giving up and to who?” Using a VPN means you’re transferring a considerable amount of your metadata away from your ISP and over to your VPN provider. Assuming you use a reputable, trustworthy VPN provider, that’s a good strategy. Encrypted emails are the same thing. Many of these companies will surrender what they can if given a warrant, but reputable companies rarely have much to turn over aside from a few login locations and times. It’s a multi-layered approach but it’s one worth considering until technology can catch up to protect our metadata by default.


Previous Next