I am looking to extract the strings text_i_wantA
, text_i_wantB
, and text_i_wantC
from each of the 3 children within each of the 10 div class = col-12
. To enhance readability, I have only included two similarly structured divs here. If the output does not currently include the actual .content[0]
, I can always parse that later.
Below is the complete code:
title,date,name,number = [],[],[],[]
while True:
soup = bs(driver.page_source, 'html5lib')
for div in soup.find_all('a', attrs={'title':'ad i'}):
titl = div.get_text(strip=True)
title.append(titl)
else:
break
for col in soup.find_all('div', attrs={'class':'col-12'})[1::2]:
row = []
for entry in col.select('div.row div'):
target = entry.find_all(text=True, recursive=False)
row.append(target[0].strip())
name.append(row[0])
date.append(row[1])
number.append(row[2])
next_btn = driver.find_elements_by_css_selector(".page-next button")
if next_btn:
actions = ActionChains(driver)
actions.move_to_element(next_btn[0]).click().perform()
time.sleep(4)
else:
break
driver.close()
Expected results are as follows:
title = ["text_i_already_have1", "text_i_already_have2", ...]
date = ["text_i_wantA", "text_i_wantAA", ...]
name = ["text_i_wantB", "text_i_wantBB", ...]
number = ["text_i_wantC", "text_i_wantCC", ...]
Issue: The current output when using slice [1::2]
title = ["text_i_already_have1", "text_i_already_have2", ...]
date = ['text_i_wantA', 'text_i_wantAA', ...],
name = ['', '', '', '', '', '', '', '', '', '']
number = ['', '', '', '', '', '', '', '', '', '']
Is the problem related to my CSS selection or the loop itself?
The first line works properly:
print(soup.find_all('div', attrs={'class':'col-12'}))
without a slice gives me a list of the div elements containing the strings I want to extract (text_i_want
):
... (omitted for brevity) ...
The unwanted string text_i_dont_want
is consistently found inside the
<span class="processlink">
element, which is always the last child within one of the 3 <div class="row">
elements contained within each of the 10 <div class="col-12">
.