博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
自动化浏览器:Selenium引导式冒险
阅读量:2517 次
发布时间:2019-05-11

本文共 11690 字,大约阅读时间需要 38 分钟。

Prerequisites: Have /and installed. See the previous if you’re not familiar with it. The full code for this post is included at the end. You might find it fun to first run the entire script and watch how it works before jumping in and following along with the post. And don’t forget to use your browser’s Developer Tools to see how XPaths were chosen. Best of luck.

先决条件:已安装 / 和 。 如果您不熟悉请参阅前面的 。 这篇文章的完整代码包含在最后。 您可能会发现有趣的是,先运行整个脚本并观看它的工作原理,然后再随处随处。 并且不要忘记使用浏览器的开发人员工具来查看如何选择XPath。 祝你好运。

One morning Todd, a loyal and industrious employee, was hard at work watching when suddenly his boss, Mr. Peabody, burst into his cubicle.

一天早晨,忠诚而勤奋的员工托德(Todd)努力工作,观看 ,突然间,他的老板皮博迪(Peabody)先生冲进了他的小卧室。

“Todd, the CEO is sick of paying everyone so much money. Figure out where in the Southwest US the wages are the lowest. We’re relocating there ASAP!”

“托德(Todd),首席执行官讨厌付给所有人这么多钱。 弄清楚美国西南部的工资最低。 我们要尽快搬到那里!”

This wasn’t the first time Todd was asked to look up something like this. Sometimes they want to know where wages are the lowest in the Northeast, or Midwest, or in individual states, and Todd is fed up with it. So many cat videos, but so little time to watch them. If only there was a way to automate it…

这不是Todd第一次被要求查找这样的东西。 有时,他们想知道东北,中西部或各个州的最低工资水平,而托德已经受够了。 这么多猫视频,却很少时间观看。 如果只有一种方法可以自动化……

With a sudden flash of inspiration, Todd realizes the Selenium package for Python can solve his problem! He just needs to write up a quick script, and then he can kiss these lengthy wage look-ups goodbye.

突然有了灵感,Todd意识到用于Python的Selenium包可以解决他的问题! 他只需要编写一个快速脚本,然后就可以亲吻这些冗长的薪资查询再见。

To start, Todd writes the code to import the packages he thinks he’ll need, open up a browser, and navigate to the (BLS) data page.

首先,Todd编写代码以导入他认为需要的软件包,打开浏览器,然后导航到美国 (BLS)数据页面。

import refrom selenium import webdriver## remember to input your own file path to your chrome driver here if you're using chromebrowser = webdriver.Chrome(executable_path="C:UsersgstantonDownloadschromedriver.exe")browser.maximize_window()url = 'http://www.bls.gov/data/'browser.get(url)

Todd wants to use the “Multi-Screen Data Search” tool in the “Employment” section to get earnings (aka wages) data for US metro areas. To do this, he identifies a viable XPath expression for the tool’s button with Chrome’s Developer Tools and then writes the code to select and click that element.

Todd希望使用“就业”部分中的“多屏数据搜索”工具来获取美国大都市地区的收入(即工资)数据。 为此,他使用Chrome的开发人员工具为该工具的按钮标识了可行的XPath表达式,然后编写代码以选择并单击该元素。

multi_screen_data_search = browser.find_element_by_xpath("//img[@title='Multi Screen Data Search for CES State and Metro Area']")multi_screen_data_search.click()

This brings Todd to a page with a couple of checkboxes, one to get data that is seasonally adjusted and one to get data that isn’t. Todd decides he wants the non-adjusted data. He finds the checkbox element, clicks it, then finds and clicks on the “Next form” button to proceed with the query.

这会将Todd带到带有两个复选框的页面,一个复选框用于获取经过季节性调整的数据,而另一个复选框用于获取未经调整的数据。 托德决定要不要调整的数据。 他找到复选框元素,单击它,然后找到并单击“下一个表单”按钮以继续进行查询。

not_seasonally_adjusted = browser.find_elements_by_xpath("//input[@name='seasonal']")not_seasonally_adjusted[1].click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()

Todd wants the average hourly earnings of all employees in dollars, and selects the appropriate options.

Todd希望所有雇员的平均小时收入以美元为单位,然后选择适当的选项。

average_hourly_earnings = browser.find_element_by_xpath("//select/option[@value='03']")average_hourly_earnings.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()

Now Todd needs to select the Southwest US region. He knows the region is typically defined as Arizona, California, Colorado, Nevada, New Mexico, and Utah, and selects the states accordingly.

现在,托德需要选择美国西南地区。 他知道该地区通常被定义为亚利桑那州,加利福尼亚州,科罗拉多州,内华达州,新墨西哥州和犹他州,并据此选择州。

## selected states: AZ, CA, CO, NM, NV, UTstate_values = ['04', '06', '08', '32', '35', '49']for value in state_values:    state_element = browser.find_element_by_xpath("//select/option[@value='{}']".format(value))    state_element.click()    next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()

Todd wants all the metros in these states, and so selects all the metro options, making sure to exclude the “Statewide” option.

托德想要这些州的所有都会区,因此选择所有都会区选项,并确保排除“州范围”选项。

all_metro_elements = browser.find_elements_by_xpath("//select[@name='area_code']/option")for metro_element in all_metro_elements:    metro_element.click()    ## de-select the statewide option at the top, we just want metrosstatewide_option = browser.find_element_by_xpath("//option[@value='00000']")statewide_option.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()

There are then a couple of screens where Todd specifies that he wants just private-sector wages.

然后,在几个屏幕上,Todd指定他只想要私营部门的工资。

total_private_option = browser.find_element_by_xpath("//option[@value='05']")total_private_option.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()total_private_option = browser.find_element_by_xpath("//option[@value='05000000']")total_private_option.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()

And finally, Todd gets to retrieve his precious data.

最后,托德开始检索他的宝贵数据。

retrieve_data = browser.find_element_by_xpath("//input[@value='Retrieve data']")retrieve_data.click()

There’s just one problem: the format the data is in. Todd sees that it is displayed in tables on the page, and there are also links, one for each metro, to download spreadsheets of the wage data. Todd doesn’t want to go to the trouble of downloading all these spreadsheets, and he certainly doesn’t want to copy and paste all the most recent wage numbers into a spreadsheet by hand, so instead he comes up with a way to quickly grab the data from the web page and determine which metro has the lowest average earnings.

这里只有一个问题:数据的格式。托德(Todd)看到它显示在页面上的表格中,并且还存在用于下载工资数据电子表格的链接(每个都市圈一个链接)。 托德不想麻烦下载所有这些电子表格,他当然也不想手工将所有最新的工资数字复制并粘贴到电子表格中,因此,他想出了一种方法来快速获取网页中的数据,然后确定哪个城市的平均收入最低。

## all the most recent wage figures have a '(P)' (for 'Preliminary') after the number## using this fact, one can easily grab all those most recent figuresrecent_hourly_earnings = browser.find_elements_by_xpath("//table/tbody/tr/td[contains(text(), '(P)')]")cleaned_recent_earnings = []for wage in recent_hourly_earnings:        ## use regex to exclude the '(P)' and grab just the wage figure    cleaned_wage = re.search('d+.d+', wage.text).group()        ## we want to convert to floats for finding the minimum later on    cleaned_recent_earnings.append(float(cleaned_wage))

Great, now Todd has the wages. But he also wants the names of the metros to pair with the wages so he knows what wages go with what metros. The only problem is that the metro name is lumped into the same element as several other pieces of info in the text above each table. But Todd sees a way to extract it.

太好了,现在托德有了工资。 但是他还希望地铁的名称与工资匹配,这样他才能知道工资与地铁的比例。 唯一的问题是,地铁名称与每个表格上方文本中的其他几条信息都包含在同一元素中。 但是Todd看到了一种提取它的方法。

## get all the groups of text above each tableall_table_texts = browser.find_elements_by_xpath("//table/caption/pre")metros = []for table_text in all_table_texts:        """Use regex to pick out just the metro name from all the text.    The name starts with a letter, followed by any combo of letters and spaces,    has a comma and a space and ends with two uppercase letters (the state abbreviation)"""    metro_name = re.search('w[w ]+, [A-Z][A-Z]', table_text.text).group()    metros.append(metro_name)

With a few final lines of code, Todd zips together the metro names and wage data and computes the minimum wage to return the name of the metro in the US Southwest with the lowest average hourly earnings.

在最后几行代码中,Todd将都市名称和工资数据压缩在一起,并计算出最低工资,以返回平均小时收入最低的美国西南部都市名称。

metro_earnings_dict = dict(zip(metros, cleaned_recent_earnings))metro_to_move_to = min(metro_earnings_dict, key=metro_earnings_dict.get)print(metro_to_move_to)

Triumphantly, Todd runs the script, shoots his boss a quick email, and goes back to watching cat videos. He gets several similar requests throughout the day, but with his script’s help Todd blessedly incurs only minor disruptions to his cat-watching regimen.

幸运的是,托德运行脚本,向老板发了一封快速电子邮件,然后又回到看猫的视频。 他全天都收到一些类似的请求,但是在剧本的帮助下,托德(Todd)幸运地只对看猫方案造成了小小的干扰。

import refrom selenium import webdriver## remember to input your own file path to your chrome driver here if you're using chromebrowser = webdriver.Chrome(executable_path="C:UsersgstantonDownloadschromedriver.exe")browser.maximize_window()url = 'http://www.bls.gov/data/'browser.get(url)multi_screen_data_search = browser.find_element_by_xpath("//img[@title='Multi Screen Data Search for CES State and Metro Area']")multi_screen_data_search.click()not_seasonally_adjusted = browser.find_elements_by_xpath("//input[@name='seasonal']")not_seasonally_adjusted[1].click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()average_hourly_earnings = browser.find_element_by_xpath("//select/option[@value='03']")average_hourly_earnings.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()## selected states: AZ, CA, CO, NM, NV, UTstate_values = ['04', '06', '08', '32', '35', '49']for value in state_values:    state_element = browser.find_element_by_xpath("//select/option[@value='{}']".format(value))    state_element.click()    next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()all_metro_elements = browser.find_elements_by_xpath("//select[@name='area_code']/option")for metro_element in all_metro_elements:    metro_element.click()    ## de-select the statewide option at the top, we just want metrosstatewide_option = browser.find_element_by_xpath("//option[@value='00000']")statewide_option.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()total_private_option = browser.find_element_by_xpath("//option[@value='05']")total_private_option.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()total_private_option = browser.find_element_by_xpath("//option[@value='05000000']")total_private_option.click()next_form = browser.find_element_by_xpath("//input[@value='Next form']")next_form.click()retrieve_data = browser.find_element_by_xpath("//input[@value='Retrieve data']")retrieve_data.click()## all the most recent wage figures have a '(P)' (for 'Preliminary') after the number## using this fact, one can easily grab all those most recent figuresrecent_hourly_earnings = browser.find_elements_by_xpath("//table/tbody/tr/td[contains(text(), '(P)')]")cleaned_recent_earnings = []for wage in recent_hourly_earnings:    ## use regex to exclude the '(P)' and grab just the wage figure    cleaned_wage = re.search('d+.d+', wage.text).group()    ## we want to convert to floats for finding the minimum    cleaned_recent_earnings.append(float(cleaned_wage))## get all the groups of text above each tableall_table_texts = browser.find_elements_by_xpath("//table/caption/pre")metros = []for table_text in all_table_texts:    ## use regex to pick out just the metro name    ## it starts with a letter, followed by any combo of letters and spaces...    ## has a comma and a space and ends with two uppercase letters (the state abbreviation)    metro_name = re.search('w[w ]+, [A-Z][A-Z]', table_text.text).group()    metros.append(metro_name)    metro_earnings_dict = dict(zip(metros, cleaned_recent_earnings))metro_to_move_to = min(metro_earnings_dict, key=metro_earnings_dict.get)print(metro_to_move_to)

翻译自:

转载地址:http://ueqwd.baihongyu.com/

你可能感兴趣的文章
AOP面向切面编程C#实例
查看>>
Win form碎知识点
查看>>
避免使用不必要的浮动
查看>>
第一节:ASP.NET开发环境配置
查看>>
sqlserver database常用命令
查看>>
rsync远程同步的基本配置与使用
查看>>
第二天作业
查看>>
访问属性和访问实例变量的区别
查看>>
Spring MVC 异常处理 - SimpleMappingExceptionResolver
查看>>
props 父组件给子组件传递参数
查看>>
【loj6038】「雅礼集训 2017 Day5」远行 树的直径+并查集+LCT
查看>>
十二种获取Spring的上下文环境ApplicationContext的方法
查看>>
UVA 11346 Probability 概率 (连续概率)
查看>>
linux uniq 命令
查看>>
Openssl rand命令
查看>>
HDU2825 Wireless Password 【AC自动机】【状压DP】
查看>>
BZOJ1015: [JSOI2008]星球大战starwar【并查集】【傻逼题】
查看>>
HUT-XXXX Strange display 容斥定理,线性规划
查看>>
mac修改用户名
查看>>
一道关于员工与部门查询的SQL笔试题
查看>>