The Web Is Full of Data You Cannot Use
The internet is the world's largest database, but most of its data is locked inside web pages designed for human reading, not machine processing. You can see a table of product specifications, a list of real estate listings, or a directory of business contacts, but getting that data into a spreadsheet or database requires tedious manual copying, unreliable browser extensions, or expensive third-party scraping services that demand technical expertise to configure.
Tensor's data extraction agent changes this equation completely. You navigate to a page, describe what data you want to extract in plain English, and Tensor handles everything: parsing the page structure, identifying the data elements, handling pagination, and exporting the results in your preferred format. No coding, no CSS selectors, no XPath expressions.
Your First Extraction
Let us start with a simple example. You are on a webpage that displays a table of product specifications for laptops. Open the Tensor sidepanel and type:
"Extract all the laptop data from this table into a structured format. Include product name, price, processor, RAM, storage, and display size."
Tensor reads the page, identifies the table, maps the columns to your requested fields, and extracts every row. Within seconds, you have a clean, structured dataset ready to export. The agent handles common table complications automatically: merged cells, nested headers, footnote references, and inconsistent formatting across rows.
Extracting Lists and Cards
Not all data lives in tables. E-commerce sites display products as cards in a grid. Real estate sites show listings with images and details. Job boards present opportunities in list format. Tensor handles all of these layouts because it understands page structure visually rather than relying on specific HTML patterns.
For a product catalog page with card-style layouts, you might say:
"Extract all products on this page. For each product, get the name, price, rating, number of reviews, and whether it's in stock."
Tensor identifies each product card as a repeating unit, extracts the specified fields from each one, and assembles the results into a uniform dataset. If some products are missing certain fields (no rating yet, or "price on request"), those cells are left empty rather than filled with incorrect data.
Handling Pagination
Most data you want to extract spans multiple pages. A product catalog might have 20 pages, a directory might have hundreds. Manually clicking through each page and extracting data separately would defeat the purpose of automation. Tensor handles pagination natively:
- Extract data from the current page.
- Detect the "Next" button, page numbers, or infinite scroll mechanism.
- Navigate to the next page and extract data from it.
- Repeat until all pages have been processed or a maximum page count is reached.
- Combine all extracted data into a single unified dataset.
You can set a maximum page limit to control how much data you collect. For large datasets, Tensor shows a progress indicator so you can monitor how far through the pagination it has progressed. If you need to stop early, you can pause the extraction and keep the data collected so far.
Tensor supports all common pagination patterns: numbered page links, "Load More" buttons, infinite scroll with lazy loading, and "Next/Previous" navigation. It even handles AJAX-based pagination where clicking a page number updates the content without changing the URL.
Complex Extraction Patterns
Some extraction tasks require navigating into detail pages. For example, a directory might list companies with basic information on the listing page but detailed contact information on each company's individual profile page. Tensor handles this with a two-level extraction pattern:
- Extract links and basic info from the listing page.
- Visit each detail page to collect additional fields.
- Merge the listing-level and detail-level data into a single row per entity.
This nested extraction is powerful for building comprehensive datasets. You might extract a list of restaurants from a review site, then visit each restaurant's page to collect their full address, phone number, hours of operation, and menu highlights. The final dataset contains everything, assembled from multiple page levels automatically.
Data Cleaning and Normalization
Raw web data is messy. Prices might include currency symbols, commas, and varying decimal formats. Dates appear in dozens of formats. Phone numbers include dashes, parentheses, or spaces inconsistently. Tensor applies intelligent normalization during extraction:
- Prices: Strips currency symbols and normalizes to decimal format. "$1,299.99" becomes 1299.99 with the currency stored separately.
- Dates: Converts all date formats to ISO 8601 (YYYY-MM-DD) for consistent sorting and filtering.
- Phone numbers: Normalizes to a consistent format with country code.
- Text fields: Strips excessive whitespace, removes invisible characters, and normalizes Unicode.
- Boolean fields: Converts "Yes/No", "In Stock/Out of Stock", checkmarks, and similar indicators to true/false values.
You can customize the normalization rules or disable them entirely if you prefer the raw data. Some users want prices as they appear on the page for screenshot comparison; others want clean numbers for spreadsheet analysis.
Exporting to JSON and CSV
Once extraction is complete, Tensor offers multiple export formats. Click Export in the extraction results panel and choose your format:
- CSV: The universal spreadsheet format. Open directly in Excel, Google Sheets, or any data analysis tool. Column headers match the field names you specified during extraction.
- JSON: Structured data format ideal for developers and applications. Each row becomes a JSON object with typed values (numbers as numbers, booleans as booleans).
- Clipboard: Copy the data directly to your clipboard as a tab-separated table, ready to paste into any spreadsheet application.
For large datasets, Tensor can also export incrementally, writing data to a file as it extracts rather than holding everything in memory. This is useful when extracting thousands of records across hundreds of pages.
Real-World Use Cases
The applications for structured data extraction are nearly limitless. Here are some of the most common workflows our users employ:
- Market research: Extract competitor product catalogs with prices, specifications, and availability to build comparison spreadsheets.
- Lead generation: Pull business directories, conference attendee lists, or industry association memberships into a CRM-ready format.
- Academic data collection: Gather statistics, survey results, or publication metadata from research databases and institutional websites.
- Content aggregation: Extract headlines, summaries, and metadata from news sites or blog aggregators for content analysis.
- Real estate analysis: Collect listing data including prices, square footage, location, and features from property listing sites.
From Web Pages to Spreadsheets in Seconds
Data extraction used to require either manual labor or programming skills. Tensor removes both barriers. If you can describe the data you want in plain English, you can extract it. The combination of visual page understanding, automatic pagination handling, intelligent normalization, and flexible export options makes Tensor the fastest way to turn any website into a usable dataset.