26.web scraping

Web scraping is a general term for techniques involving automating the gathering of data from a website.

Web scraping examples

Downloading Images
Copying the Information

Rules of Web Scraping

  • Always try to get permission before scrapping.
  • If you make too many scraping attempts or requests your IP address may get blocked!
  • Some sites automatically block scraping software.

Limitations of web Scraping

  • In general every website is unique, which means every web scraping script is unique.
  • A slight change or update to a website may completely break your web scraping script.

How website works??

When a browser loads a website, the gets to see the "front-end" of the website.
Browser will be converting the html code to the readable information and displays it.

Main front end components of a website are
* HTML
* CSS
* JAVA SCRIPT

HTML is used to create the basic structure and content of a webpage.
CSS is used for the design and style of a webpage, where elements are placed and how it works.
JavaScript is used to define the interactive elements of a webpage.

Python can view these HTML and CSS elements programatically, and then extract information from the website.

HTML

  • HTML is Hypertext Markup Language and is present on every website on the internet.
  • We can right-click on the website and select 'View page source' to get the HTML code of that page.

Example

<!DOCTYPE html>
<html>
  <head>
    <title> Title on Browser Tab </title>
  </head>
  <body>
    <h1> Website Header </h1>
    <p> Some paragraph </p>
  </body>
</html>

CSS

  • CSS stands for Cascading style sheets.
  • CSS gives 'style' to a website, such as changing colors and fonts
  • CSS uses tags to define what html elements will be styled.

Example

<!DOCTYPE html>
<html>
  <head>
    <link rel= "stylesheet" href="styles.css">
    <title> Some title </title>
  </head>
  <body>
    <p id='para2'> Some Text </p>
  </body>
</html>
#para2 {
  color: red;
}

Example

<!DOCTYPE html>
<html>
  <head>
    <link rel= "stylesheet" href="styles.css">
    <title> Some title </title>
  </head>
  <body>
    <p class='cool'> Some Text </p>
  </body>
</html>
.cool {
  color: red;
  font-family: verdana;
}

CSS File

p {
  color: red;
  font-family: courier;
  font-size: 160%
}

.someclass {
  color: green;
  font-family: verdana;
  font-size: 200%      
}

#someid {
  color: blue;
}

Note

HTML contains the information.
CSS contains the styling.
We can use HTML and CSS tags to locate specific information on a page.

  • To Web Scrape with python we can use the BeautifulSoup and requests libraries.
  • These are external libraries outside of python so we need to install them with either conda or pip at the command line. pip install requests pip install lxml pip install bs4

Web Scraping syntax

Syntax Match Results
soup.select('div') All elements with 'div' tag
soup.select('#some_id') Elements containing id='some_id'
soup.select('.some_class') Elements containing class='some_class'
soup.select('div span') Any element named span within a division element
soup.select('div > span) Any elements named span directly within a div element, with nothing in between.