How to Webscrap by Python Function?
- tinyytopic.com
- 0
- on Feb 16, 2023
How to Webscrap by Python Function?
Install the following module(s) if you haven’t installed them already:
pip install requests
Ready-to-use Python function to scrap content of a website:
def webscrap(url, TimeoutSec=5, verify_ssl=True): # scrap any permissible webpages by sending URL try: page = requests.get(url, timeout=TimeoutSec, verify=verify_ssl, headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}) except requests.exceptions.RequestException: return 'Webpage is not reachable!' # Extract html text HtmlTxt = page.text # get success code status = page.status_code return HtmlTxt, status
Write your main code as a sample below,
import requests
print(webscrap("https://www.digikey.com/en/products/detail/vishay-dale/CRCW1206100RFKEA/1176530"))
The output of the code is,
('<!DOCTYPE html><html lang="en-us" dir="ltr"><head><meta charSet="utf-8"/><meta name="theme-color" content="#CC0000"/><meta name="generator" content="Digi-Key Search Engine"/><link rel="icon" type="image/x-icon" href="/favicon.ico"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Global/fonts/fonts.css?la=en-US&ts=2943f40b-f61e-49aa-952b-963240340aa8"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/digit/global.css"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Global/EnavHeaderMVC/CSS/empty.css?la=en-US&ts=2aecf11d-87f3-4f54-a433-8e8d1acdb795"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/combined.css?la=en-US&ts=506e0306-ad96-41ba-8648-a3458e01664f"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/banner.css?la=en-US&ts=498625d8-0dfd-41a2-920b-debb017ca9ff"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/cookie-notice.css?la=en-US&ts=c04caf01-0bbe-4f42-a7a2-a55af095fcd0"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/modal.css?la=en-US&ts=1db0b6d2-b2df-405a-8100-203d26a93771"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Misc/SuggestionSearchBar/CSS/searchsuggest.css?la=en-US&ts=41f98d6f-b221-4c88-a872-a91f60fe5338"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/cobrowse.css?la=en-US&ts=6087909a-ef4a-4385-9dfd-3c8415fcba01"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/footer.css?la=en-US&ts=c5d5e528-2e9d-49ab-a2a0-72e2f91b2e86"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/intl-country-select-popup.css?la=en-US&ts=0fb63111-2531-4d25-98ac-31bca9089fe2"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/needHelp.css?la=en-US&ts=95faf60f-8e96-41c1-a4b3-5d908c0628d1"/><script type="text/javascript">window[\'__DK_STORE__\'] = window[\'__DK_STORE__\'] || {\n PRICING_REQUEST_TIMEOUT:6000,\n FEATURE_FLAG_MOSAIC_CART:undefined\n };</script><script type="text/javascript">var sdkInstance="appInsightsSDK";window[sdkInstance]="appInsights";var
How does the Python function work?
This is a Python function that performs web scraping on a specified URL using the Requests library. Here’s how the function works:
- The function
webscrap
takes three arguments:url
(the URL of the webpage to scrape),TimeoutSec
(the maximum time the function should wait for a response before timing out, which defaults to 5 seconds), andverify_ssl
(a boolean value that indicates whether SSL certificates should be verified, which defaults to True). - The function uses a
try-except
block to handle any exceptions that may occur while attempting to scrape the webpage. If the request is unsuccessful, the function returns the message ‘Webpage is not reachable!’. - If the request is successful, the function extracts the HTML content of the webpage using the
.text
method of theResponse
object returned by theget
method of therequests
library. - The function also retrieves the HTTP status code of the request using the
.status_code
attribute of theResponse
object. - Finally, the function returns a tuple containing the HTML content of the webpage and the HTTP status code.
Note that the function also includes a custom User-Agent
header in the request, which simulates a web browser to prevent the server from blocking the request due to the default User-Agent
header used by the requests
library.