Web scraping (also known as web harvesting
or web data
involves the automated extraction of data (numbers or text) or
usually involves writing
a program which reads a website's HTML code, and then extracts the
information from that code.
Pulling data from websites and placing it into spreadsheets
can be a very time consuming task if done manually for large amounts of
Consultants Ltd can
provide web scraping services which greatly speed up such tasks.
This would involve writing a program which takes the
Sometimes the programs would have several
stages e.g. the
first part finding the URLs of webpages from a website, and the second
extracting the information from those webpages.
- Downloads the HTML code of a webpage.
- Picks the relevant information from the HTML code.
- Repeats steps 1 and 2, if there are other webpages.
- Saves the relevant information into a spreadsheet (or
Some website scraping examples:
Directories – It may be desirable to extract
from one or more business directory sites, and store the data in a
spreadsheet. If the
data is stored as a
spreadsheet, it would be easier to sort and append notes to (e.g. which
have been considered, contacted, classified etc).
A business listings spreadsheet would usually
have one business per row, and the types of information about the
separated by columns (information such as company name, phone number,
website, description, and type). If
than one directory is scraped, it may be possible to create a super
and Reviews Data – Some websites contain many
ratings and reviews. Ratings
data has its uses in academic and market research.
If the data is stored in a spreadsheet, it
would make it easier to analyse with statistical software (or simply
summarised within the spreadsheet itself).
and Product Lists – Many online stores (and
‘bricks-and-mortar’ businesses that provide
information about their products
online) have many products listed, but those listings often lack the
flexibility for the products data to be analysed thoroughly. A spreadsheet of products
data could include
information such as the product name, product code/number, product
price, description, product webpage URL, the parent company or brand
etc. The scraping
of multiple sites could make it
much easier to make price comparisons and give an indication of pricing
lists could be used for business intelligence (e.g. gaining information
competitor’s prices) or procurement.
After the data has been extracted, further programming might
be performed to enrich the data set e.g. a binary variable stating
not an entry contains a particular keyword.
See the Statistical
Programming / Data Processing Services page for more
Some websites may have a large number of
files available for
download all of the
files, it may be more efficient to automate the process rather than
The downloading could be automated by having a program
written which takes the following steps:
- Download the HTML code of the webpages where the
- Search the HTML code, and record the URLs where the
- Download the files.