How to get and parse HTML pages in Python?

This page explains how to download an HTML page from its URL with Python to get meta data, title and links. You can try this example online on replit.

The following Python code uses the packets BeautifulSoup and request.

from bs4 import BeautifulSoup
import requests

Get HTML from URL

The first part get the HTML source code from any URL.

url = ''
r = requests.get(url, allow_redirects=True)
print (r.text)

Parse with BeautifulSoup

To parse the HTML, you first need to convert the HTML script to a BeautifulSoup object, which represents the document as a nested data structure:

soup = BeautifulSoup(r.text, features="html5lib")

Get header page title

To get the header page title (in the meta data), you have to look for the tag title:

# Get header page title
title = soup.find('title')
print (title.string)

Get page title (h1 tag)

To get the page title (heading 1), you have to look for the tag h1:

title = soup.find('h1')
print (title.string)

Get page description

To get the page description (meta data), you have to look for the meta tag description:

description = soup.find("meta", attrs={'name':'description'})
print (description["content"])

To get the open graph description, change the property name:description for og:description:

description = soup.find("meta", property="og:description")
if (description is not None):
  print (description)

Get links

To get all the links in the page, you have to look for the tag a and get the href property:

for link in soup.findAll('a'):
  print (link.get('href'))

See also

Last update : 05/23/2022