API Scraping

2024-08-19 12:09:56

Executive Summary

API Scraping is used to extract structured data from a website’s backend APIs directly. It’s essential to differentiate between legitimate API usage and scraping, as the latter can lead to unintended data exposure. API scraping raises security concerns, particularly when APIs lack proper authentication or are exposed to malicious attacks. Real-world incidents, such as the LinkedIn, Uber, and Clubhouse scraping cases, underscore the severe risks and consequences of inadequate API security. These breaches can expose valuable business information, leaving organizations vulnerable to data theft, competitive disadvantage, and reputational damage.

To mitigate these risks, investing in robust security solutions is essential. Codesealer offers advanced protection by encrypting API communications and ensuring the integrity and confidentiality of data.

What is API Scraping?

API scraping refers to the process of extracting data from an API (Application Programming Interface) instead of a website’s HTML content. For many applications, APIs have become the central channel where data flows from the backend to the user interface. APIs often have full access to sensitive data and high permissions. Using API Scraping, attackers get direct access to the service or platform’s backend to retrieve structured valuable business data.

How API Scraping Works

API Access: Many services and websites offer APIs that developers can access. These APIs typically require an API key or authentication token to ensure that only authorized users can access the data. It is not uncommon for APIs to lack proper authentication and authorization controls, leaving a room for unauthorized third-party clients to scrape your data. By examining the APIs on a website using the Network tab within Chrome’s DevTools, an attacker can view all of the website’s API endpoints and analyze the structure of the requests.

Developer console of JuiceShop website without Codesealer

Here is a close view of one of the endpoints:

The endpoint we discovered through the Network tab reveals various parameters, query options, and payload values.

With this information, a malicious user can start constructing custom requests to interact with the APIs directly. As an example, an attacker can retrieve a list of all items in the inventory, including product details, stock levels, pricing and more. This can then be used to undercut prices, identify high-demand products, outcompete on SEO or even launch targeted attacks against the supply chain. This kind of unauthorized access could have serious implications for the business, including financial losses, reputational damage, and potential legal consequences if sensitive data is exposed.

Sending Requests: To scrape data using an API, a developer sends HTTP requests to the API’s endpoints. These requests often include parameters that specify the desired data. In this case, we will focus on making an API request to search for orange juice.

curl -X GET "https://juice-direct-ewee8wae.codesealer.com/rest/products/search?q=orange%20juice"

Sometimes, an API endpoint will also include a limit option, which restricts the number of responses returned. This helps prevent issues like data overflow or even denial of service.

Receiving Responses: The API returns data in a structured format, usually in JSON or XML. This format is typically well-organized, making it easier to parse and manipulate compared to scraping raw HTML.

curl -X GET "https://juice-direct-ewee8wae.codesealer.com/rest/products/search?q=orange%20juice"

{"status":"success","data":[{"id":2,"name":"Orange Juice (1000ml)","description":"Made from oranges hand-picked by Uncle Dittmeyer.","price":2.99,"deluxePrice":2.49,"image":"orange_juice.jpg","createdAt":"2024-08-13 12:47:02.457 +00:00","updatedAt":"2024-08-13 12:47:02.457 +00:00","deletedAt":null}]}%

Parsing and Storing Data: Once the data is received, it can be parsed and stored in a database, file, or used directly within an application. If you want to save the output to a file :

curl -X GET "https://juice-direct-ewee8wae.codesealer.com/rest/products/search?q=orange%20juice" -o output.json -i

API Scraping Use Cases

APIs are the backbone of our digital world, enabling activities like tweeting, checking the weather, liking Instagram posts, and making bank transactions. In fact, 71% of internet traffic in 2024 was related to APIs, according to The State of API Security in 2024 Report from Imperva, surpassing normal web traffic. The data accessed through these APIs is particularly valuable, making them prime targets for both third-party clients and malicious users.

API scraping can be used for various purposes, such as analytics, marketing, data analysis, etc. You can use it to access stock prices, historical financial data, or cryptocurrency information via financial APIs, or retrieve weather data from services like OpenWeatherMap or WeatherAPI. However, scraping remains a gray area when it comes to legality. Typically, scraping violates the terms of use of a website, especially if the scraping is done with the intention to sell the data or use it maliciously.

Security Issues and Risks Associated with API Scraping

API scraping poses significant security risks to targeted entities, especially when carried out with malicious intent. Unauthorized access to business data—such as pricing, inventory, or proprietary content—can lead to the loss of a competitive advantage and other serious consequences. Additionally, poorly configured API endpoints that lack robust authentication and authorization controls expose systems to the risk of data breaches. Sensitive information may be leaked, and unauthorized transactions could occur, leading to financial loss and damage to the company’s reputation.

For instance, consider a web shop that sells various consumer electronics. This shop uses APIs to manage its inventory, display product details, and update stock information. While the front end of the website only shows limited details, such as the products currently available for purchase, the underlying APIs may expose much more information. An attacker could use the Network tab in Chrome’s DevTools to monitor the traffic between the web shop’s front end and its APIs. By examining these API requests and responses, the attacker might discover endpoints that expose sensitive data not visible on the public-facing site. For example, the attacker could:

Scrape Inventory Data: By querying the inventory API, the attacker could retrieve a complete list of all products in the inventory, including those not yet released or listed on the website. This could reveal unreleased products or discontinued items that the company hasn’t made public yet.
Access Pricing Information: The attacker might also gain access to wholesale pricing, discounts, or other confidential pricing details that are not intended to be visible to customers. This information could be used by competitors to undercut prices or by malicious actors to manipulate purchasing behaviors.
Monitor Stock Levels: The attacker could track stock levels for various products in real-time, identifying trends such as popular items that are running low on stock. This information could be exploited to launch automated purchasing bots that buy up high-demand products before legitimate customers have a chance.
Uncover Proprietary Information: The APIs might expose details about upcoming product launches, special promotions, or other strategic business decisions. Competitors could use this information to preemptively launch similar products or marketing campaigns.

Real-World Attacks Enabled by API Scraping

Uber’s API Data Leak (2016)

In 2016, a vulnerability in Uber’s API allowed unauthorized users to access sensitive data. Attackers were able to scrape trip details, including driver and passenger information, by exploiting a lack of proper authentication on certain endpoints. The issue was discovered when developers noticed that private data could be accessed through the API without proper authorization, exposing personal details of millions of users.

LinkedIn Data Scraping (2012 & 2021)

LinkedIn has been a frequent target of API scraping attacks. In 2012, LinkedIn filed a lawsuit against a company called hiQ Labs, which scraped data from LinkedIn’s public profiles using bots. In 2021, another massive scraping event occurred where the data of 700 million LinkedIn users (over 90% of the user base) was scraped and sold on the dark web. The scraped data included personal information such as email addresses, phone numbers, and job details.

Clubhouse API Scraping (2021)

In early 2021, the social audio app Clubhouse experienced an API scraping incident where a third-party developer created a website that allowed users to listen to conversations happening on the platform. The developer used Clubhouse’s APIs to scrape data about users, rooms, and conversations, which was then made accessible on the website without the app’s permission.

Investing in the Right Security Solutions

API scraping can serve both as a tool for better data insights and as a method for potential attacks. Given that many attacks target API traffic, investing in a proper API security solution is crucial for protecting against such threats.

To close potential security gaps in your application, Codesealer offers a solution that prevents reconnaissance of the attack surface. By encrypting all APIs, Codesealer hides potentially valuable information from attackers, preventing them from accessing the APIs directly and seeing payload structures and responses.

Developer console for JuiceShop website with Codesealer

Looking further into one of the API endpoints, we do not see any parameters that would help us construct a custom call. All the endpoints have a generic structure like ~bl/x/.

One of the API endpoints with Codesealer

Payload are also encrypted, giving zero valuable information to the attacker:

By securing the communication channel from the browser to the backend, we ensure the integrity and confidentiality of data throughout its journey. This approach significantly mitigates the risk of API scraping, as all API endpoints are encrypted, preventing attackers from intercepting or deciphering the data. Without access to the decrypted API traffic, attackers are unable to construct valid API payloads or query the endpoints for sensitive information, effectively rendering API scraping attempts pointless.

Codesealer’s solution involves multiple layers of security. Our client-side Bootloader verifies the integrity of the application code before it is executed, ensuring that no unauthorized modifications have occurred. Once the application is running, it establishes a secure end-to-end (E2E) tunnel that encrypts all data, rendering it inaccessible to attackers. This approach not only protects against API attacks but also enhances overall security by ensuring that both the application code and data remain secure.

With Codesealer in place, the Coinbase bug caused by a missing logic validation check would not have been exploited, as the APIs would have been encrypted, making them useless to attackers. Similarly, Instagram would not have risked the privacy of their users due to unprotected APIs. Imagine how many more attacks could be prevented with Codesealer in place.

Check our product video to see how Codesealer encrypts APIs and removes attack surface in one click