How We Combat Scraping

5 years ago

By Mike Clark, Product Management Director

Last week, we shared more details about a public database containing information about people on Facebook that appeared online and generated a lot of conversation around data scraping. Given the fact that similar stories have emerged since then about public datasets involving information obtained from a number of other companies, including LinkedIn and Clubhouse, we’d like to explain more about what scraping is, how it works and what we’re doing to prevent scraping to protect people’s information.

What Is Scraping?

Scraping is the automated collection of data from a website or app and can be both authorized and unauthorized. Every time you use a search engine, for example, you are likely using data which was scraped in automated ways with the consent of the website or app. This is a form of scraping known as crawling and it’s what helps make the internet searchable.

Using automation to get data from Facebook without our permission is a violation of our terms. The data itself is not necessarily off-limits; scraped data is often widely available for ordinary people to access in their everyday use of the website or app. Scrapers may not access or collect data from our products using automated means without our prior permission.

Scrapers commonly try to blend in with others and can hide what they’re doing by mimicking the ways people would normally use a product. As a result, it can be difficult to detect them. We do however, have a number of methods to distinguish unauthorized, automated activity from legitimate usage, which we explain below.

What We’re Doing About Scraping

We devote substantial resources to combating unauthorized scraping on Facebook products. We have a dedicated External Data Misuse (EDM) team made up of more than 100 people, including data scientists, analysts and engineers focused on our efforts to detect, block and deter scraping.

Because scrapers mimic the ways that people use our products legitimately, we will never be able to fully prevent all scraping without harming people’s ability to use our apps and websites the way they enjoy. That means that we have to try to strike the right balance and rely on a variety of approaches to address scraping. Since scraping is both a common and complex challenge to solve, we take a more holistic approach to staying ahead of it. In short, we aim to make it harder for scrapers to acquire data from our services in the first place and harder to capitalize off of it if they do.

The first way we aim to make scraping more difficult is through the use of rate limits and data limits. Rate limits cap the number of times anyone can interact with our products in a given amount of time, while data limits keep people from getting more data than they should need to use our products normally.

Limits are only a first layer of protection, and we know that scrapers are determined to find new ways to get data. That’s why we’ve also focused on developing other methods of identifying and deterring scraping. We won’t go into all of them because we don’t want to give a roadmap to scrapers seeking to evade our defenses, but one example is that we look for patterns in activity and behavior that are typically associated with automated computer activity and stop it.

Our EDM team also investigates suspected scrapers to learn more about what they’re doing and make our systems stronger. We’ve taken a variety of actions against data misuse. These can include sending cease and desist letters, disabling accounts, filing lawsuits against scrapers engaging in egregious behavior and requesting companies that host scraped data to take them down. This is also why it’s important for governments to do more to investigate and take action against unlawful scraping activity.

Scrapers who improperly collect data from Facebook sometimes make this data available in online forums such as the one that was reported on last week. The EDM team tries to keep that data from being shared online by engaging with threat intelligence researchers to look for examples of these datasets being shared and work with responsible hosting vendors to get them taken offline.

What You Can Do to Help Keep Your Data Safe

In addition to the steps we take to protect your data, we also want to empower the people who use our services to make it harder for their information to be misused. That’s why our existing user privacy controls allow people to adjust their settings for things like what information of theirs is public or who can look them up by their phone number. We also recently launched a dedicated page in our Help Center to inform people about scraping and what they can do to protect their information. For example, our Privacy Checkup feature helps walk people through their privacy and security settings, including Who Can See What You Share and How People Can Find You on Facebook. We recommend people review their privacy settings regularly to ensure they align with their current preferences. In addition, our Privacy Matters page provides more insight into our privacy initiatives, and we plan to publish more about our approach to scraping and what we find.