am trying to scrape info from this website https://www.heiminfo.ch/institutionen the HTML looks like this where the info am looking for is stored.
<article class="institution card pushed" data-name="HOF SPEICHER AG - (di Gallo)" data-institution-type="HIALTER HIEB CVAPPENZELLALTER" data-subscription="SILBER" data-zoom="15" data-track-content="" data-content-target="Huta5R8" data-lng="9.441113" data-group="Kurt di Gallo Holding AG" data-content-piece="Huta5R8" data-content-name="Institution View List" data-lat="47.41353" style="height: 249.95px;" data-ol-has-click-handler=""> <a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler=""> <div class="img-container"> <img class=" lazyloaded" width="450" src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" data-src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" alt="HOF SPEICHER AG"> </div> <div class="text-container" style="height: 114.99px;"> <div class="name-and-addition"> <h2 style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">HOF SPEICHER AG </font></font></h2> <p class="addition" style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">(di Gallo)</font></font></p> </div> <p class="location"> <span class="canton"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">AR </font></font></span> <span class="plz"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">9042 </font></font></span> <span class="city"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">memory</font></font></span> </p> </div> </a> </article>
I've been able to obtain first 500 institution names,city, plz, location. using this code:courtesy of Arundeep Chohan
import requests import time import pandas as pd import csv from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from time import sleep from random import randint from bs4 import BeautifulSoup from selenium import webdriver as wb driver=wb.Chrome('chromedriver.exe') driver.maximize_window() driver.get(' https://www.heiminfo.ch/institutionen') button=driver.find_element_by_xpath('/html/body/div[1]/main/div/section/form/div[1]/div[3]/div/button[3]').click(); wait=WebDriverWait(driver, 5) total=500 h=[] while True: try: soup=BeautifulSoup(driver.page_source, 'html.parser') item=soup.find(class_='institutions') #item=driver.find_element_by_class_name('institutions') lsh=item.find_all(class_="name-and-addition") #lsh=item.find_element_by_class_name('name-and-addition') if(len(lsh)>=total): for e in lsh[:total]: h(e.text.strip) data=pd.DataFrame(zip(h), columns=['Adult Homes']) print(data) break wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".next.btn"))).click() time.sleep(5) except Exception as e: print(e) break
the remaining info is the phone number which hidden within the tag "<a> href=", which I have to click to open to obtain the telephone number. the totals number of these "<a> href=" is 1589. how can I write a scraper to iterate thru' all these links and obtain the hidden telephone number? the links look like this :
[<a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">][1]