How to Webscrap by Python Function?


How to Webscrap by Python Function?

Install the following module(s) if you haven’t installed them already:

pip install requests

Ready-to-use Python function to scrap content of a website:

def webscrap(url, TimeoutSec=5, verify_ssl=True):
    # scrap any permissible webpages by sending URL
    try:
        page = requests.get(url, timeout=TimeoutSec, verify=verify_ssl, headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"})
    except requests.exceptions.RequestException:
        return 'Webpage is not reachable!'
    
    # Extract html text
    HtmlTxt = page.text
    
    # get success code
    status = page.status_code
    
    return HtmlTxt, status

Write your main code as a sample below,

import requests

print(webscrap("https://www.digikey.com/en/products/detail/vishay-dale/CRCW1206100RFKEA/1176530"))

The output of the code is,

('<!DOCTYPE html><html lang="en-us" dir="ltr"><head><meta charSet="utf-8"/><meta name="theme-color" content="#CC0000"/><meta name="generator" content="Digi-Key Search Engine"/><link rel="icon" type="image/x-icon" href="/favicon.ico"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Global/fonts/fonts.css?la=en-US&amp;ts=2943f40b-f61e-49aa-952b-963240340aa8"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/digit/global.css"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Global/EnavHeaderMVC/CSS/empty.css?la=en-US&amp;ts=2aecf11d-87f3-4f54-a433-8e8d1acdb795"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/combined.css?la=en-US&amp;ts=506e0306-ad96-41ba-8648-a3458e01664f"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/banner.css?la=en-US&amp;ts=498625d8-0dfd-41a2-920b-debb017ca9ff"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/cookie-notice.css?la=en-US&amp;ts=c04caf01-0bbe-4f42-a7a2-a55af095fcd0"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Header/ENav2021/CSS/modal.css?la=en-US&amp;ts=1db0b6d2-b2df-405a-8100-203d26a93771"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Misc/SuggestionSearchBar/CSS/searchsuggest.css?la=en-US&amp;ts=41f98d6f-b221-4c88-a872-a91f60fe5338"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/cobrowse.css?la=en-US&amp;ts=6087909a-ef4a-4385-9dfd-3c8415fcba01"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/footer.css?la=en-US&amp;ts=c5d5e528-2e9d-49ab-a2a0-72e2f91b2e86"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/intl-country-select-popup.css?la=en-US&amp;ts=0fb63111-2531-4d25-98ac-31bca9089fe2"/><link rel="stylesheet" type="text/css" href="//www.digikey.com/-/media/Designer/Footer/Footer%20Redesign/MVC/CSS/needHelp.css?la=en-US&amp;ts=95faf60f-8e96-41c1-a4b3-5d908c0628d1"/><script type="text/javascript">window[\'__DK_STORE__\'] = window[\'__DK_STORE__\'] || {\n    PRICING_REQUEST_TIMEOUT:6000,\n    FEATURE_FLAG_MOSAIC_CART:undefined\n  };</script><script type="text/javascript">var sdkInstance="appInsightsSDK";window[sdkInstance]="appInsights";var 
How does the Python function work?

This is a Python function that performs web scraping on a specified URL using the Requests library. Here’s how the function works:

  1. The function webscrap takes three arguments: url (the URL of the webpage to scrape), TimeoutSec (the maximum time the function should wait for a response before timing out, which defaults to 5 seconds), and verify_ssl (a boolean value that indicates whether SSL certificates should be verified, which defaults to True).
  2. The function uses a try-except block to handle any exceptions that may occur while attempting to scrape the webpage. If the request is unsuccessful, the function returns the message ‘Webpage is not reachable!’.
  3. If the request is successful, the function extracts the HTML content of the webpage using the .text method of the Response object returned by the get method of the requests library.
  4. The function also retrieves the HTTP status code of the request using the .status_code attribute of the Response object.
  5. Finally, the function returns a tuple containing the HTML content of the webpage and the HTTP status code.

Note that the function also includes a custom User-Agent header in the request, which simulates a web browser to prevent the server from blocking the request due to the default User-Agent header used by the requests library.

Leave a Reply

Your email address will not be published. Required fields are marked *