Find a string within HTML text


How to find string within an HTML text using the Ready-to-use function?

Ready-to-use Python function to find a string in an HTML text:

def find_string_for_webscrap(HtmlTxt, first_char, last_char, start_location=1):
    # Find a string from HTML text data
    
    output_char = 'Not found!'
    EndLoc = 0
    
    InitLoc = HtmlTxt.find(first_char, start_location) + len(first_char)
    if InitLoc < len(first_char): return output_char, EndLoc
    
    EndLoc = HtmlTxt.find(last_char, InitLoc)
    if EndLoc < 1: return output_char, EndLoc
    
    output_char = HtmlTxt[InitLoc:EndLoc]
    output_char = ' '.join(output_char.split()) # replace continues white spaces with single space
    
    return output_char, EndLoc

Write your main code as a sample below,

HtmlTxt, status_code = webscrap("https://www.digikey.com/en/products/detail/vishay-dale/CRCW1206100RFKEA/1176530", 30)
txt, EndLoc = find_string_for_webscrap(HtmlTxt, 'ref_part_description=', ';', 1)
print(txt)
txt, EndLoc = find_string_for_webscrap(HtmlTxt, 'ref_part_available', ';', EndLoc)
print(txt)

The output of the code is,

RES SMD 100 OHM 1% 1/4W 1206
=
How does the function work?

This Python function find_string_for_webscrap takes in four parameters:

  1. HtmlTxt – a string that contains HTML text data that needs to be searched
  2. first_char – a string that specifies the first character or sequence of characters that needs to be found
  3. last_char – a string that specifies the last character or sequence of characters that needs to be found
  4. start_location (optional) – an integer that specifies the starting location for the search. If not specified, the default value is 1.

The function first initializes the output character to ‘Not found!’ and the end location to 0. It then searches for the starting location of the first character using the find() method, and adds the length of the first character to get the initial location. If the initial location is less than the length of the first character, it means that the first character was not found and the function returns the output character as ‘Not found!’ and the end location as 0.

If the first character is found, the function searches for the end location of the last character using the find() method, starting from the initial location. If the end location is less than 1, it means that the last character was not found and the function returns the output character as ‘Not found!’ and the end location as 0.

If both the first and last characters are found, the function extracts the substring between the initial and end locations using slicing and stores it in the output character. Finally, the function removes any continuous white spaces in the output character using split() and join() methods and returns the output character and end location.

Overall, this function searches for a substring in a given HTML text data between two specified characters, and returns the substring and the end location.

Leave a Reply

Your email address will not be published. Required fields are marked *