26.web scraping
Web scraping is a general term for techniques involving automating the gathering of data from a website.
Web scraping examples
Downloading Images
Copying the Information
Rules of Web Scraping
- Always try to get permission before scrapping.
- If you make too many scraping attempts or requests your IP address may get blocked!
- Some sites automatically block scraping software.
Limitations of web Scraping
- In general every website is unique, which means every web scraping script is unique.
- A slight change or update to a website may completely break your web scraping script.
How website works??
When a browser loads a website, the gets to see the "front-end" of the website.
Browser will be converting the html code to the readable information and displays it.
Main front end components of a website are
* HTML
* CSS
* JAVA SCRIPT
HTML is used to create the basic structure and content of a webpage.
CSS is used for the design and style of a webpage, where elements are placed and how it works.
JavaScript is used to define the interactive elements of a webpage.
Python can view these HTML and CSS elements programatically, and then extract information from the website.
HTML
- HTML is Hypertext Markup Language and is present on every website on the internet.
- We can right-click on the website and select 'View page source' to get the HTML code of that page.
Example
<!DOCTYPE html>
<html>
<head>
<title> Title on Browser Tab </title>
</head>
<body>
<h1> Website Header </h1>
<p> Some paragraph </p>
</body>
</html>
CSS
- CSS stands for Cascading style sheets.
- CSS gives 'style' to a website, such as changing colors and fonts
- CSS uses tags to define what html elements will be styled.
Example
<!DOCTYPE html>
<html>
<head>
<link rel= "stylesheet" href="styles.css">
<title> Some title </title>
</head>
<body>
<p id='para2'> Some Text </p>
</body>
</html>
#para2 {
color: red;
}
Example
<!DOCTYPE html>
<html>
<head>
<link rel= "stylesheet" href="styles.css">
<title> Some title </title>
</head>
<body>
<p class='cool'> Some Text </p>
</body>
</html>
.cool {
color: red;
font-family: verdana;
}
CSS File
p {
color: red;
font-family: courier;
font-size: 160%
}
.someclass {
color: green;
font-family: verdana;
font-size: 200%
}
#someid {
color: blue;
}
Note
HTML contains the information.
CSS contains the styling.
We can use HTML and CSS tags to locate specific information on a page.
- To Web Scrape with python we can use the BeautifulSoup and requests libraries.
- These are external libraries outside of python so we need to install them with either conda or pip at the command line.
pip install requests
pip install lxml
pip install bs4
Web Scraping syntax
Syntax | Match Results |
---|---|
soup.select('div') | All elements with 'div' tag |
soup.select('#some_id') | Elements containing id='some_id' |
soup.select('.some_class') | Elements containing class='some_class' |
soup.select('div span') | Any element named span within a division element |
soup.select('div > span) | Any elements named span directly within a div element, with nothing in between. |