Scraping free proxy listing website

It is generally best to use a parser such as BeautifulSoup to extra data from html rather than regular expressions because it is very difficult to reproduce BeautifulSoup's acturacy; however, you can try this with pure regex:

import re
url = ''
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text)
data = [list(filter(None, i))[0] for i in re.findall('<td class="hm">(.*?)</td>|<td>(.*?)</td>', source)]
groupings = [dict(zip(['ip', 'port', 'code', 'using_anonymous'], data[i:i+4])) for i in range(0, len(data), 4)]

Sample output (actual length is 300):

[{'ip': '', 'port': '80', 'code': 'SG', 'using_anonymous': 'anonymous'}, {'ip': '', 'port': '80', 'code': 'SG', 'using_anonymous': 'elite proxy'}, {'ip': '', 'port': '54566', 'code': 'PL', 'using_anonymous': 'anonymous'}, {'ip': '', 'port': '8080', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '', 'port': '3128', 'code': 'BR', 'using_anonymous': 'anonymous'}, {'ip': '', 'port': '8080', 'code': 'PH', 'using_anonymous': 'anonymous'}, {'ip': '', 'port': '80', 'code': 'US', 'using_anonymous': 'anonymous'}, {'ip': '', 'port': '3128', 'code': 'NL', 'using_anonymous': 'elite proxy'}, {'ip': '', 'port': '3128', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '', 'port': '8080', 'code': 'RU', 'using_anonymous': 'anonymous'}]

Edit: to concatenate the ip and the port, iterate over each grouping and use string formatting:

final_groupings = [{'full_ip':"{ip}:{port}".format(**i)} for i in groupings]


[{'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}, {'full_ip': ''}]

You can do something like below as well, if you try using BeautifulSoup instead of regex:

import requests
from bs4 import BeautifulSoup

res = requests.get('', headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for items in"#proxylisttable tbody tr"):
    proxy_list = ':'.join([item.text for item in"td")[:2]])

Partial output: