How I Built a Back-Link Checker Using ChatGPT and Google Colab

I recently started a new website and to rank up in the search engines I hired an SEO expert. He suggested several activities out of which one was building lots of backlinks. We agreed on the budget and the timeframe in which he delivers 200 links in a month. These links are a mix of social bookmarks, profile pages, and guest articles. After he delivers the Links, I have to verify the work done and give feedback. The process of checking a URL containing the backlink is as follows –

  • Open a URL from a list of URLs in the sheet
  • Search for my β€œmydomain.com” in the source code or click on the visible link pointing to my domain
  • In the sheet, mark it as LINK FOUND or LINK NOT FOUND

Only if it was this easy.

The link-building project will go on for several months. He will be building several hundred links so I must have some tool to do the checking for me. So, I came up with the idea of asking ChatGPT to give me a code in Python. Here is my prompt:

Prompt

Give me a Python code that scrapes all given URLs and looks for any link that contains "mydomain.com" and prints out the URL and the link it found

ChatGPT reply: πŸ€–

import requests
from bs4 import BeautifulSoup

# list of URLs to scrape
urls = ['https://example.com', 'https://example.net', 'https://example.org']

# loop through each URL
for url in urls:
    # send a request to the URL
    response = requests.get(url)

    # parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # find all links in the page
    links = soup.find_all('a')

    # loop through each link
    for link in links:
        # check if the link contains "mydomain.com"
        if 'partymap.in' in link.get('href', ''):
            # print the URL and link found
            print(f"{url} - {link.get('href')}")

Choosing Google Colab as Web-Based Python Interface

Now I have to run this code in a web interface. The interface should have an input box and an output box.

I find Google Collab to be the best option to do this in a quick, easy, and time-efficient manner. But I need the code to read the list of URLs from an input widget box. So I returned to ChatGPT and defined my environment along with many improvisations. There were several trials and errors.

Here are some of the prompts

Improvisation Prompt 1: πŸ‘©β€πŸ’»

Also add the following

Display domains that are duplicate
Display unique list of domains in which the string was not found

Improvisation Prompt 2: πŸ‘©β€πŸ’»

I got this error 

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)

Improvisation Prompt 3: πŸ‘©β€πŸ’»

Check for Redirection, if the URL redirects, print "E:REDIRECTED" and skip iteration

Improvisation Prompt 4: πŸ‘©β€πŸ’»

I got a mod_security error in request.get, how can I fix it

Improvisation Prompt 5: πŸ‘©β€πŸ’»

Add a try catch block around request and beautiful soup

Improvisation Prompt 6: πŸ‘©β€πŸ’»

If there are no Links found, print "E:ZERO LINKS" and skip iteration

Improvisation Prompt 7: πŸ‘©β€πŸ’»

The list of URLs will come from a google collab input box can you make the change

And there were many more prompts to achieve the final results. But, since I am a Python coder, I could exit the back and forth with ChatGPT and change the code my way.

ERROR/STATUS CODES

Explanation of error codes is as follows

Errors found in URL given in the sheet

  • UNRESOLVED – The URL in the sheet is malformed
  • DUPLICATE DOMAIN – There are multiple URLs from the same domain
  • REDIRECTED – The URL redirected to another URL, if this happens ask the SEO analyst to post the final URL in the sheet

Errors found in Links found in the source code of the URL

  • FOUND – Our domain backlink was found
  • NOT FOUND – Our domain backlink was not found
  • BAD LINK – Our domain backlink was not found
  • ZERO LINKS – No links were found in the source code

I begin each error code with β€˜E:’ to easily identify them in sheet for conditional formatting process.

So here is the final code:

The Code

This goes in the first code cell of Google Colab

from IPython.display import display
import ipywidgets as widgets

url_box = widgets.Textarea(
    placeholder='Enter URLs here',
    description='URLs:',
    layout=widgets.Layout(width='70%')
)

# display the text box widget
display(url_box)

This goes in the second code cell of Google Colab

/enl

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

# disable SSL certificate verification
requests.packages.urllib3.disable_warnings()

# get the input URLs as a list
urls = url_box.value.split()
# create lists to store URLs and domains
scraped_urls = []
unique_domains = []
duplicate_domains = []
notfound_domains = []
inputstring = ""

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# loop through each URL
for url in urls:
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
# add the domain to the list of unique domains
    if domain not in unique_domains:
        unique_domains.append(domain)
    else:
        # add the domain to the list of duplicate domains
        if domain not in duplicate_domains:
            duplicate_domains.append(domain)

print("Duplicate domains:", len(duplicate_domains))
print(duplicate_domains)
print()

# loop through each URL and check if the backlink exists
for url in urls:
    inputstring = ""
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
    if not domain:
      print('E:UNRESOLVED',',',domain)
      continue
    
    if domain in duplicate_domains:
      print("E:DUPLICATE DOMAIN")
      continue
    # send a request to the URL
    try:
      response = requests.get(url, headers=headers, verify=False)
    except Exception as e:
      print('REQ:',str(e))
    
    # check if the URL is redirecting to "mydomain.com"
    # check if the response is a redirect
    if hasattr(response, 'is_redirect') and response.is_redirect:
        print("E:REDIRECTED",',',domain)
        continue
    # parse the HTML content with BeautifulSoup
    try:
      soup = BeautifulSoup(response.content, 'html.parser')
    except Exception as e:
      print('BS:',str(e))

    # find all links in the page
    links = soup.find_all('a')
    # print(links)
    
    #if no links found
    if len(links) == 0:
      print('E:ZERO LINKS',',',domain)
      continue
    
    # loop through each link
    for link in links:
      # Get the domain name from the link
      parsed_url = urlparse(link.get('href', ''))
      domain_name = parsed_url.netloc
      # print(domain_name)
      # domain_name = link.get('href', '')
      if domain_name:
        # Check if the domain name is "mydomain.com"        
        if 'mydomain.com' in domain_name:
          # print(domain_name)
          inputstring = "FOUND"
          break
        else:
          inputstring = "E:NOT FOUND"
          # if domain not in notfound_domains:
          #   notfound_domains.append(domain)
      else:
        inputstring = "E:BAD LINK"
    # add the URL to the list of scraped URLs
    # scraped_urls.append(inputstring)
    print(inputstring,',',domain)

See the CELL setup in the image. Press play in the first cell. You will get a URL input box. Paste your URLs in it.

Input Box:

https://sketchfab.tld/mydomain 
https://30seconds.tld/mydomain/
https://speakerdeck.tld/mydomainus
https://www.ted.tld/profiles/<some page>/about
https://dzone.tld/users/mydomainindia.html
https://www.reddit.tld/user/mydomainusa
https://medium.tld/@mydomainusa/about
https://www.pinterest.tld/mydomainusa/
https://www.intensedebate.tld/people/mydomainusa
https://www.growkudos.tld/profile/<some page>
https://www.universe.tld/users/<some page>
https://www.dostally.tld/post/<some page>
https://www.socialbookmarkzone.info/<some page>
https://app.raindrop.io/my/-1/item/<somepage>/web
https://www.tamaiaz.tld/posts/<somepage>
https://www.socialbookmarkzone.info/<some page>/
https://gab.tld/mydomain/posts/<some page>

Now press Play in the second cell and watch output panel

Output:

Duplicate domains: 5
['www.socialbookmarkzone.tld, 'www.reddit.tld', 'www.instapaper.tld', 'www.wibki.tld', 'diigo.tld']

FOUND , sketchfab.tld
E:BAD LINK , 30seconds.tld
FOUND , speakerdeck.tld
E:BAD LINK , www.ted.tld
FOUND , dzone.tld
E:DUPLICATE DOMAIN
FOUND , medium.tld
FOUND , www.pinterest.tld
FOUND , www.intensedebate.tld
FOUND , www.growkudos.tld
E:ZERO LINKS , www.universe.tld
FOUND , www.dostally.tld
E:DUPLICATE DOMAIN
E:ZERO LINKS , app.raindrop.io
FOUND , www.tamaiaz.tld
E:DUPLICATE DOMAIN
E:NOT FOUND , gab.tld

INPUT BOX CODE [GOOGLE COLLAB]

GOOGLE COLLAB CODE CELL SETUP

PASTE THE OUTPUT IN YOUR SEO TRACKER SHEET in the same line as the URLs & APPLY SPLIT TEXT TO COLUMN

STEPS TO APPLY CONDITIONAL FORMATTING

FINAL OUTPUT

Based on the above output the SEO analyst can rework on the links or drop these sites completely.

If you like the code leave a comment and I am available on Upwork for Prompt Engineering, AI Art jobs. I use ChatGPT, Midjourney, Python and many more tools for my client jobs.

My Upwork profile is https://www.upwork.com/freelancers/~018645334d3b757e4d


πŸ‘©β€πŸ’» Recommended: 7 Effective Prompting Tricks for ChatGPT