2015-11-04

football2sms

爬虫

推送足球比分的一个爬虫

标签（空格分隔）：爬虫

github地址在此
主要诱因是爆流量了，然后又刚好最近比赛密集，所以就想做一个能推送当日比赛成绩以及战报的小爬虫了。一开始目标是懂球帝，结果发现竟然是动态内容。后来终于让我找到了网易体育这个对于制作爬虫而言非常良心的网站。好了，既然有了网站我们就可以开始分析啦。

 <tr>
<td>11</td>
<td>02:45</td>
<td>完场</td>
<td><span class="c1"><a href="/58/team/7431.html">马尔默</a><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/7431.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /></span></td>#主队
<td><span class="c3"><a href="/58/match/stat/2015/1574739.html" target="_blank">1-0</a></span></td>#比分
<td><span class="c2"><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/6457.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /><a href="/58/team/6457.html">顿涅茨克矿工</a></span></td>#客队
<td>&nbsp;</td>
<td class="bg2 bg7">
	<a href="/58/match/stat/2015/1574739.html" target="_blank">统计</a> | <span class="cur_hand"  id="check_1574739_58_2015" style="cursor: pointer;" >查看详细</span> <img src="http://img1.cache.netease.com/sports/2009/goal/slbg33.gif" width="5" height="13" /> | 
	
<a href="http://caipiao.163.com/order/preBet_jczqspfp.html&&t=2325#from=sj1" target="_blank">投注</a>
 </td>
 </tr>

从这段代码中不难看出。class=’c1’对应的是span标签的正文是主队，class=’c2’对应的则是客队，而class=’c3’对应的则是比分。然后接下来就可以形成代码啦！

def get_bifeng(mytime):
#    date = '20151101'
    date = mytime
    goal_url = 'http://goal.sports.163.com/schedule/' + date + '.html'#构成网址
    response = urllib2.urlopen(goal_url)
    page = response.read()
    soup = BeautifulSoup(page)#构造bs
    tag_zhudui = soup.find_all('span', 'c1')
    tag_kedui = soup.find_all('span', 'c2')
    tag_bifeng = soup.find_all('span', 'c3')
    temp_bifeng = []
    bifeng = ' '

通过这段代码可以轻松得到主队，客队还有比分的列表。那么如何把三个列表合并并且格式化输出，我用的是比较简单的两个两个合并然后输出的方法，代码如下：

for (i, j) in zip(tag_zhudui, tag_bifeng):
    temp_bifeng.append(i.get_text().encode('utf-8') + ' ' + j.get_text().encode('utf-8') + ' ')
for (i, j) in zip(tag_kedui, temp_bifeng):
    bifeng += (j + ' ' + i.get_text().encode('utf-8')+'\n')
return bifeng

这段函数有个缺陷就是偶尔会有当日没有结束的比赛，这就造成了，有主队和客队却没有对应的比分选项，以后再修这个bug。

接下来就是获取战报啦，咱们还是先分析网页的源码：

<tr>
                    <td>11</td>
                    <td>02:45</td>
                    <td>完场</td>
                    <td><span class="c1"><a href="/58/team/6409.html">巴黎圣日耳曼</a><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/6409.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /></span></td>
                    <td><span class="c3"><a href="/58/match/stat/2015/1574752.html" target="_blank">0-0</a></span></td>
                    <td><span class="c2"><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/6171.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /><a href="/58/team/6171.html">皇家马德里</a></span></td>
                    <td>&nbsp;</td>
                    <td class="bg2 bg7">#战报对应的标签在这里！
                        <a href="/58/match/report/2015/1574752.html" target="_blank">战报</a> | <span class="cur_hand"  id="check_1574752_58_2015" style="cursor: pointer;" >查看详细</span> <img src="http://img1.cache.netease.com/sports/2009/goal/slbg33.gif" width="5" height="13" /> |

不难看出战报对应的标签为td下的class=’bg2 bg7’，然后形成代码：

def get_zhanbao(mytime):
#    date = '20151101'
    date = mytime
    goal_url = 'http://goal.sports.163.com/schedule/' + date + '.html'#构成网址
    response = urllib2.urlopen(goal_url)
    page = response.read()
    soup = BeautifulSoup(page)#构造bs
    tag = soup.find_all(class_='bg2 bg7')#查找对应标签
    zhanbao = []
    mail_content = ''
    for i in tag:
        if i.get_text().encode('utf-8')[7:13] == '战报':#判断关键字段为‘统计’还是‘战报’，这里注意utf-8三个值为一个汉字
            zhanbao.append('http://goal.sports.163.com'+i.find('a').get('href'))#存储所有‘战报’的超链接
#    print zhanbao
    for i in zhanbao:
        response = urllib2.urlopen(i)
        page = response.read()
        soup = BeautifulSoup(page)
        tag = soup.find_all('b')#在战报页面中，‘b’标签（加粗），对应的就是打门，这就是我所需要的啦。
        if tag == '':
            break
        for i in tag:
            mail_content += i.get_text().encode('utf-8')+'|'#构造邮件内容
            print i
#    print mail_content
    return mail_content

然后再写好函数部分，最后加到crontab里面就ok啦，至于邮箱，网上py发邮箱的方法很多，就不再啰嗦啦。

能够改进的地方：

以后或许考虑直接推送处理好的html页面，这样可以包含更多更直观的数据（这样逼格高许多）
格式化输出那莫名其妙的空格问题
发邮件看看能不能一封战报一个邮件，这样最后到短信上的效果会好很多