I recently started a new website and to rank up in the search engines I hired an SEO expert. He suggested several activities out of which one was building lots of backlinks. We agreed on the budget and the timeframe in which he delivers 200 links in a month. These links are a mix of social bookmarks, profile pages, and guest articles. After he delivers the Links, I have to verify the work done and give feedback. The process of checking a URL containing the backlink is as follows β
- Open a URL from a list of URLs in the sheet
- Search for my βmydomain.comβ in the source code or click on the visible link pointing to my domain
- In the sheet, mark it as
LINK FOUND
orLINK NOT FOUND
Only if it was this easy.
The link-building project will go on for several months. He will be building several hundred links so I must have some tool to do the checking for me. So, I came up with the idea of asking ChatGPT to give me a code in Python. Here is my prompt:
Prompt
Give me a Python code that scrapes all given URLs and looks for any link that contains "mydomain.com" and prints out the URL and the link it found
ChatGPT reply: π€
import requests from bs4 import BeautifulSoup # list of URLs to scrape urls = ['https://example.com', 'https://example.net', 'https://example.org'] # loop through each URL for url in urls: # send a request to the URL response = requests.get(url) # parse the HTML content with BeautifulSoup soup = BeautifulSoup(response.content, 'html.parser') # find all links in the page links = soup.find_all('a') # loop through each link for link in links: # check if the link contains "mydomain.com" if 'partymap.in' in link.get('href', ''): # print the URL and link found print(f"{url} - {link.get('href')}")
Choosing Google Colab as Web-Based Python Interface
Now I have to run this code in a web interface. The interface should have an input box and an output box.
I find Google Collab to be the best option to do this in a quick, easy, and time-efficient manner. But I need the code to read the list of URLs from an input widget box. So I returned to ChatGPT and defined my environment along with many improvisations. There were several trials and errors.
Here are some of the prompts
Improvisation Prompt 1: π©βπ»
Also add the following
Display domains that are duplicate
Display unique list of domains in which the string was not found
Improvisation Prompt 2: π©βπ»
I got this error
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)
Improvisation Prompt 3: π©βπ»
Check for Redirection, if the URL redirects, print "E:REDIRECTED" and skip iteration
Improvisation Prompt 4: π©βπ»
I got a mod_security error in request.get, how can I fix it
Improvisation Prompt 5: π©βπ»
Add a try catch block around request and beautiful soup
Improvisation Prompt 6: π©βπ»
If there are no Links found, print "E:ZERO LINKS" and skip iteration
Improvisation Prompt 7: π©βπ»
The list of URLs will come from a google collab input box can you make the change
And there were many more prompts to achieve the final results. But, since I am a Python coder, I could exit the back and forth with ChatGPT and change the code my way.
ERROR/STATUS CODES
Explanation of error codes is as follows
Errors found in URL given in the sheet
- UNRESOLVED β The URL in the sheet is malformed
- DUPLICATE DOMAIN β There are multiple URLs from the same domain
- REDIRECTED β The URL redirected to another URL, if this happens ask the SEO analyst to post the final URL in the sheet
Errors found in Links found in the source code of the URL
- FOUND β Our domain backlink was found
- NOT FOUND β Our domain backlink was not found
- BAD LINK β Our domain backlink was not found
- ZERO LINKS β No links were found in the source code
I begin each error code with βE:β to easily identify them in sheet for conditional formatting process.
So here is the final code:
The Code
This goes in the first code cell of Google Colab
from IPython.display import display import ipywidgets as widgets url_box = widgets.Textarea( placeholder='Enter URLs here', description='URLs:', layout=widgets.Layout(width='70%') ) # display the text box widget display(url_box)
This goes in the second code cell of Google Colab
/enl
import requests from bs4 import BeautifulSoup from urllib.parse import urlparse # disable SSL certificate verification requests.packages.urllib3.disable_warnings() # get the input URLs as a list urls = url_box.value.split() # create lists to store URLs and domains scraped_urls = [] unique_domains = [] duplicate_domains = [] notfound_domains = [] inputstring = "" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' } # loop through each URL for url in urls: parsed_url = urlparse(url) domain = parsed_url.netloc # add the domain to the list of unique domains if domain not in unique_domains: unique_domains.append(domain) else: # add the domain to the list of duplicate domains if domain not in duplicate_domains: duplicate_domains.append(domain) print("Duplicate domains:", len(duplicate_domains)) print(duplicate_domains) print() # loop through each URL and check if the backlink exists for url in urls: inputstring = "" parsed_url = urlparse(url) domain = parsed_url.netloc if not domain: print('E:UNRESOLVED',',',domain) continue if domain in duplicate_domains: print("E:DUPLICATE DOMAIN") continue # send a request to the URL try: response = requests.get(url, headers=headers, verify=False) except Exception as e: print('REQ:',str(e)) # check if the URL is redirecting to "mydomain.com" # check if the response is a redirect if hasattr(response, 'is_redirect') and response.is_redirect: print("E:REDIRECTED",',',domain) continue # parse the HTML content with BeautifulSoup try: soup = BeautifulSoup(response.content, 'html.parser') except Exception as e: print('BS:',str(e)) # find all links in the page links = soup.find_all('a') # print(links) #if no links found if len(links) == 0: print('E:ZERO LINKS',',',domain) continue # loop through each link for link in links: # Get the domain name from the link parsed_url = urlparse(link.get('href', '')) domain_name = parsed_url.netloc # print(domain_name) # domain_name = link.get('href', '') if domain_name: # Check if the domain name is "mydomain.com" if 'mydomain.com' in domain_name: # print(domain_name) inputstring = "FOUND" break else: inputstring = "E:NOT FOUND" # if domain not in notfound_domains: # notfound_domains.append(domain) else: inputstring = "E:BAD LINK" # add the URL to the list of scraped URLs # scraped_urls.append(inputstring) print(inputstring,',',domain)
See the CELL setup in the image. Press play in the first cell. You will get a URL input box. Paste your URLs in it.
Input Box:
https://sketchfab.tld/mydomain
https://30seconds.tld/mydomain/
https://speakerdeck.tld/mydomainus
https://www.ted.tld/profiles/<some page>/about
https://dzone.tld/users/mydomainindia.html
https://www.reddit.tld/user/mydomainusa
https://medium.tld/@mydomainusa/about
https://www.pinterest.tld/mydomainusa/
https://www.intensedebate.tld/people/mydomainusa
https://www.growkudos.tld/profile/<some page>
https://www.universe.tld/users/<some page>
https://www.dostally.tld/post/<some page>
https://www.socialbookmarkzone.info/<some page>
https://app.raindrop.io/my/-1/item/<somepage>/web
https://www.tamaiaz.tld/posts/<somepage>
https://www.socialbookmarkzone.info/<some page>/
https://gab.tld/mydomain/posts/<some page>
Now press Play in the second cell and watch output panel
Output:
Duplicate domains: 5
['www.socialbookmarkzone.tld, 'www.reddit.tld', 'www.instapaper.tld', 'www.wibki.tld', 'diigo.tld']
FOUND , sketchfab.tld
E:BAD LINK , 30seconds.tld
FOUND , speakerdeck.tld
E:BAD LINK , www.ted.tld
FOUND , dzone.tld
E:DUPLICATE DOMAIN
FOUND , medium.tld
FOUND , www.pinterest.tld
FOUND , www.intensedebate.tld
FOUND , www.growkudos.tld
E:ZERO LINKS , www.universe.tld
FOUND , www.dostally.tld
E:DUPLICATE DOMAIN
E:ZERO LINKS , app.raindrop.io
FOUND , www.tamaiaz.tld
E:DUPLICATE DOMAIN
E:NOT FOUND , gab.tld
INPUT BOX CODE [GOOGLE COLLAB]

GOOGLE COLLAB CODE CELL SETUP

PASTE THE OUTPUT IN YOUR SEO TRACKER SHEET in the same line as the URLs & APPLY SPLIT TEXT TO COLUMN

STEPS TO APPLY CONDITIONAL FORMATTING

FINAL OUTPUT

Based on the above output the SEO analyst can rework on the links or drop these sites completely.
If you like the code leave a comment and I am available on Upwork for Prompt Engineering, AI Art jobs. I use ChatGPT, Midjourney, Python and many more tools for my client jobs.
My Upwork profile is https://www.upwork.com/freelancers/~018645334d3b757e4d
π©βπ» Recommended: 7 Effective Prompting Tricks for ChatGPT