2015-08-19

hipdacrawler

爬虫

用python爬虫抓取D版帖子正文

标签（空格分隔）：爬虫

其实这次也没有多少创新啦，所用到的东西和上一遍博文的类似。主要还是对网页源码的处理。
变化的有两点:
1.处理帖子列表
2.获取帖子正文

1.处理帖子列表，获得url:

<td class="folder">
<a href="viewthread.php?tid=1686592&amp;extra=page%3D1" title="新窗口打开" target="_blank">
<img src="images/default/folder_new.gif" /></a>
</td>
<td class="icon">
&nbsp;</td>
<th class="subject new">
<label>&nbsp;</label>
 <span id="thread_1686592"><a href="viewthread.php?tid=1686592&amp;extra=page%3D1">金庸群侠传X for ios free</a></span>
<img src="images/attachicons/image_s.gif" alt="图片附件" class="attach" />
<span class="threadpages">&nbsp;...<a href="viewthread.php?tid=1686592&amp;extra=page%3D1&amp;page=2">2</a><a href="viewthread.php?tid=1686592&amp;extra=page%3D1&amp;page=3">3</a><a href="viewthread.php?tid=1686592&amp;extra=page%3D1&amp;page=4">4</a><a href="viewthread.php?tid=1686592&amp;extra=page%3D1&amp;page=5">5</a></span>
</th>

从这里可以看出，帖子的url在

1	<a href="viewthread.php?tid=1686592&extra=page%3D1">

这个标签中，但是这个标签重复了两次，一开始我是想用title=”新窗口打开”来处理的，但是没能成功，后面索性加了个计数器，隔行获取就搞掂了。代码如下：

counter = 2
for i in urls:
	if counter%2 == 0:
		f = open('temp_'+str(num)+'.txt','a+')
		f.write('http://www.hi-pda.com/forum/'+str(i.get('href'))+'\n')
		f.close()
	counter+=1
print 'get url task '+str(num)+' done!!!'

2.帖子正文的获取：
这个个人感觉比较简单

<div class="t_msgfontfix">
<table cellspacing="0" cellpadding="0"><tr><td class="t_msgfont" id="postmessage_32370210">家里是电信10年老用户，升级到100M光纤了，家里有iphone，ipad，笔记本，小米盒子等需要连接无线路由器，求推荐个符合宽带和使用需求的路由器，目前是几十块的，感觉不行，看动画片都卡</td></tr></table>
</div>

从这里不难看出，获取的关键字是id=”psotmessage_”+任意数字，在beautifulsoup里面直接构造就好了。

f = open('temp_'+str(num)+'.txt','r')
for line in f.readlines():
	response = opener.open(line)
	page = BeautifulSoup(response)
	content = page.find_all(id = re.compile('postmessage_'))
	for i in content:
		fcontent = open('temp_content.txt','a+')
		fcontent.write(i.get_text().encode('utf-8'))
		fcontent.close()
print 'task  '+str(num)+'done!!!'

最终效果：
此处输入图片的描述

从图三中也可以看到一个问题，就是那种引用回复的会提高重复率，到时候进行关键词分析必然是会产生影响的，但是现在也没有想出什么好办法去除掉。下一个练手项目就是抓cl的小黄文!!!