Go goquery 简介

go语言爬虫利器 goquery
github.com/PuerkitoBio/goquery
goquery 基于Go net/htm包和 css选择器库 cascadia
net/htm解析器返回 DOM节点不是完整的DOM树因此jQuery状态操作函数没有实现(像height() css() detach)
goquery只支持utf-8编码其他编码需要转换
package main
import (
   "fmt"
   "log"
   "strings"
   "github.com/PuerkitoBio/goquery"
)
func main() {
html := `<html>
<body><h1 id="title">春晓</h1>
<p>春眠不觉晓
处处闻啼鸟
夜来风雨声
花落知多少</p></body></html>
`
dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
if err != nil {
      log.Fatalln(err)
}
dom.Find("p").Each(func(i int, selection *goquery.Selection) {
      fmt.Println(selection.Text())
})
}
NewDocumentFromReader() 返回了*Document 和 error Document 代表将要被操作的HTML文档
Find() 获取当前匹配元素集中每个元素的子代
p是元素选择器匹配出所有的p标签
Each() 是迭代器遍历选择的节点参数是匿名函数匿名函数有2个参数
i是元素的索引位置 selection 是选择的结果集匹配到的内容
Text() 是获取匹配元素集中的文本内容

goquery常用选择器

基于HTML Element 元素的选择器
语法为 dom.Find("p") 匹配文档中所有的p标签

ID 选择器
标记一个唯一的id 使用id选择器精确定位
id选择器以#开头紧跟着元素id的值 dom.Find("#title")
匹配文档中所有的 id="title"的内容
dom.Find("p#title")匹配 ID 是title的p标签

Class选择器
id选择器以.开头紧跟着元素 class 的值 dom.Find(".content")
匹配文档中所有的元素
指定某一个标签dom.Find("div.content")

属性选择器
通过元素的属性和属性值来筛选数据
dom.Find("p[class=content] 匹配文档中所有 class属性是content的p标签
自定义属性也是可以的
选择器    说明
Find("div[my]")     含有my属性的div元素
Find("div[my=zh]")    my属性为zh的div元素
Find("div[my!=zh]")    my属性不等于zh的div元素
Find("div[my¦=zh]")    my属性为zh或者zh-开头的div元素
Find("div[my*=zh]")    my属性包含zh这个字符串的div元素
Find("div[my~=zh]")    my属性包含zh这个单词的div元素单词以空格分开的
Find("div[my$=zh]")    my属性以zh结尾的div元素区分大小写
Find("div[my^=zh]")    my属性以zh开头的div元素区分大小写

parent > child选择器
筛选出某个元素下的子元素
使用>符号连接 dom.Find("div>p")   筛选div标签下的p标签
package main
import (
   "fmt"
   "log"
   "strings"
   "testing"
   "github.com/PuerkitoBio/goquery"
)
func TestFirst(t *testing.T) {
   html := `<body>
   <p>P1</p>
   <div>DIV1</div>
   <div>DIV2</div>
   <div>DIV3</div>
   <span><div>DIV4</div></span>
   <p>P2</p>
</body>`
dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
if err != nil {
   log.Fatalln(err)
}
dom.Find("body>div").Each(func(i int, selection *goquery.Selection) {
   fmt.Println(selection.Text())
})
}//output
DIV1
DIV2
DIV3
筛选出body这个父元素下符合条件的最直接的子元素div
结果三个DIV1 DIV2 DIV3 虽然DIV4也是body的子元素但不是一级的不会被筛选到
把DIV4也筛选出来
就是要筛选body下所有的div元素不管是一级二级还是N级
只需要把大于号(>)改为空格就好了选择器
dom.Find("body div").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})

相邻选择器element + next
筛选的元素没有规律该元素的上一个元素有规律可以使用这种下一个相邻选择器来进行选择
<div>
<p my="a">a</p>
<p>b</p>
<p>c</p>
</div>
筛选出b所在的标签
dom.Find("p[my=a]+p")筛选出p标签属性my的值为a的相邻p标签
这种选择器的语法是("prev+next") 中间是一个加号(+) +号前后也是选择器

兄弟选择器 element~next
筛选同一父元素下不相邻的标签可以使用兄弟选择器
筛选出 b 和c 所在标签
dom.Find("p[my=a]~p") 筛选出p标签属性my的值为a的兄弟p标签

dom.Find("div[lang=zh]~p").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})//把P2也筛选出来，因为P2、P1和DIV1都是兄弟

goquery 过滤器

对选择出来的结果进行过滤
:contains 过滤器
筛选出的元素要包含指定的文本如包含a的p标签
dom.Find("p:contains(a)") 筛选出内容包含a的p标签
Find(":has(selector)")和 contains 差不多只不过这个是包含的是元素节点
Find(":empty")表示筛选出的元素都不能有子元素（包括文本元素）只筛选那些不包含任何子元素的元素

:first-child过滤器和:first-of-type过滤器
筛选出的元素要是父元素的第一个子元素如果不是则不会被筛选出来
Find("p:first-child”) 筛选出第一个p标签
:first-child选择器限制的比较死必须得是第一个子元素
dom.Find("div:first-child").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Html())
})如果该元素前有其他在前面就不能用:first-child了这时候:first-of-type就派上用场了要求只要是这个类型的第一个就可以
package main
import (
   "fmt"
   "log"
   "strings"
   "testing"
   "github.com/PuerkitoBio/goquery"
)
func TestFirst(t *testing.T) {
html := `<body>
   <div>DIV1</div>
   <p>P1</p>
   <div>DIV2</div>
   <div>DIV3</div>
   <span>
       <p>P2</p>
       <div>DIV5</div>
   </span>
   <div>DIV6</div>
</body>`
dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
if err != nil {
   log.Fatalln(err)
}
dom.Find("div:first-of-type").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Html())
})
} 把原来的DIV4换成了P2
如果还使用:first-child
DIV5 不能被筛选出来因为它不是第一个子元素它前面还有一个P2
使用:first-of-type 可以因为它要求是同类型第一个就可以 DIV5就是这个div类型的第一个元素 P2不是div类型被忽略

:last-child 和:last-of-type过滤器
和上面的:first-child、:first-of-type相反

:nth-child(n) 过滤器
筛选出的元素是其父元素的第n个元素 n以1开始所以可以知道:first-child和:nth-child(1)是相等的通过指定n 筛选需要的元素
dom.Find("div:nth-child(3)").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Html())
})//筛选出DIV2 因为DIV2是其父元素body的第三个子元素

:nth-of-type(n) 过滤器
:nth-of-type(n)和 :nth-child(n) 类似表示的是同类型元素的第n个所以:nth-of-type(1) 和 :first-of-type是相等的
:nth-last-child(n) 和:nth-last-of-type(n) 过滤器
这两个和上面的类似只不过是倒序开始计算的最后一个元素被当成了第一个

:only-child 过滤器和 :only-of-type 过滤器
筛选出父元素中只有它自己的一个的元素独子元素独生子女还有奖励现在看来真惨
html := `<body>
   <div>DIV1</div>
   <span>
       <div>DIV5</div>
   </span>
</body>
`
dom.Find("div:only-child").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Html())
})//DIV5 被筛选出来因为它是它的父元素span 独子元素
dom.Find("div:only-of-type").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Html())
})// 同类型元素只要只有一个就可以被筛选出来 DIV1 DIV5

goquery 选择器或(|)运算

同时筛选出div,span等元素
采用多个选择器进行组合以逗号 , 分割 Find("selector1, selector2, selectorN")表示
只要满足其中一个选择器就可以被筛选出来也就是选择器的或(|)运算操作
func main() {
html := `<body>
   <div>DIV1</div>
   <span>
       <div>DIV5</div>
   </span>
</body>
`
dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html))
if err!=nil{
log.Fatalln(err)
}
dom.Find("div,span").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Html())
})
}

goquery 常用方法

类似函数的位置操作
Find(selection) *Selection //根据选择器查找节点集
Eq(index int) *Selection //根据索引获取某个节点集
First() *Selection //获取第一个子节点集
Last() *Selection //获取最后一个子节点集
Next() *Selection //获取下一个兄弟节点集
NextAll() *Selection //获取后面所有兄弟节点集
Prev() *Selection //前一个兄弟节点集
Get(index int) *html.Node //根据索引获取一个节点
Index() int //返回选择对象中第一个元素的位置
Slice(start, end int) *Selection //根据起始位置获取子节点集

循环遍历选择的节点
Each(f func(int, *Selection)) *Selection //遍历
EachWithBreak(f func(int, *Selection) bool) *Selection //可中断遍历
Map(f func(int, *Selection) string) (result []string) //返回字符串数组

检测或获取节点属性值
Attr(), RemoveAttr(), SetAttr() //获取移除设置属性的值
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html() //获取该节点的html
Length() //返回该Selection的元素个数
Text() //获取该节点的文本值
在文档树之间来回跳转（常用的查找节点方法）
Children() //返回selection中各个节点下的孩子节点
Contents() //获取当前节点下的所有节点
Find() //查找获取当前匹配的元素
Next() //下一个元素
Prev() //上一个元素
总结
goquery 解析HTML网页利器用goquery选择器抓取工作事半功倍

Go goquery 简介

goquery常用选择器

goquery 过滤器

goquery 选择器或(|)运算

goquery 常用方法

0篇笔记写笔记

尊贵的董事大人

Go goquery 简介

goquery常用选择器

goquery 过滤器

goquery 选择器或(|)运算

goquery 常用方法

0篇笔记写笔记

尊贵的董事大人

分类导航

Advertisement

微信关注