This page explains how to download an HTML page from its URL with Python to get meta data, title and links. You can try this example online on replit.
The following Python code uses the packets BeautifulSoup and request.
from bs4 import BeautifulSoup
import requests
The first part get the HTML source code from any URL.
url = 'https://url-example.com/'
r = requests.get(url, allow_redirects=True)
print (r.text)
To parse the HTML, you first need to convert the HTML script to a BeautifulSoup
object,
which represents the document as a nested data structure:
soup = BeautifulSoup(r.text, features="html5lib")
To get the header page title (in the meta data), you have to look for the tag title
:
# Get header page title
title = soup.find('title')
print (title.string)
To get the page title (heading 1), you have to look for the tag h1
:
title = soup.find('h1')
print (title.string)
To get the page description (meta data), you have to look for the meta tag description
:
description = soup.find("meta", attrs={'name':'description'})
print (description["content"])
To get the open graph description, change the property name:description
for og:description
:
description = soup.find("meta", property="og:description")
if (description is not None):
print (description)
To get all the links in the page, you have to look for the tag a
and get the href
property:
for link in soup.findAll('a'):
print (link.get('href'))