写点什么

Shell 脚本 - 简单爬虫

发布于: 2021 年 05 月 23 日
  1. 编写该爬虫的需求: 每次在下载电视剧的时候, 都要手动的点击很多次才能使用迅雷下载一部电视剧, 很麻烦, 而迅雷会自动监听剪切板, 只要将种子链接地址添加到剪切板, 迅雷会自动启动并下载文件.

  2. 算法:

a.使用 curl 获取对应的网址, 然后输出到 html 源文件[thunder.html], 该文件在脚本执行完成后删除.

b.解析 thunder.html 文件, 获取所有的种子链接, 并添加到结果文件[thunder.thdresult], 该文件在脚本执行完成后删除, 具体解析步骤如下:

  • 使用 grep 获取包含种子链接地址值的标签字符串

  • 使用 sed 从标签字符串中获取种子链接地址的值

c.读取种子文件并添加到剪切板中(本机系统为 MacOS).


  1. 所需基础知识

a. Shell 的基本知识, 如 if 判断, 变量赋值, 引入外部文件.

b. grep 正则表达式获取子字符串.

c. sed 正则表达式获取子字符串.

d. 文件读取与写入.

e. 使用剪切板.

  1. 代码(本人为加强英文学习与使用, 习惯使用英文注释, 看不懂注释的地方请使用 Google 翻译)

a. 代码以获取美剧天天页面作为例子;

b. 文件结构

  • 配置文件[.meijutt], 用于存放不同需求下的配置.

  • 公用 shell[source_from_url_common.sh], 用于实现所有功能, 包括下载网页并存放到文件, 解析文件, 写入文件, 读取文件并添加到剪切板中, 删除过程中的两个文件.

  • 具体执行 shell[meijutt.sh].

  1. 代码详解

a. 具体执行 shell [meijutt.sh]

SHELL_FOLDER=$(dirname "$0") # 获取执行脚本的目录

source "$SHELL_FOLDER/.meijutt" # 加载配置文件

source "$SHELL_FOLDER/source_from_url_common.sh" # 加载公用 shell 并执行所有功能


b. 配置文件 [.meijutt]

#迅雷的数据文件目录, 用于存放脚本执行过程中产生的文件

THUNDER_DATA_DIR=$THUNDER_ROOT_DIR

#grep 命令的获取标签字符串的的正则表达式, 此时是在 value 中不包含<>

PATTEN_TO_GREP='name="down_url_list_0" class="down_url" value="[^<>]*" file_name='

#sed 命令获取种子链接地址的正则表达式

PATTEN_TO_SED='.*value="\(.*\)".*'


c. 公用 shell[source_from_url_common.sh], 只节选部分, 详细代码请翻到底部

  • 获取 url 内容并写入文件

thd_file_name="thunder"temp_file="$root_dir/$thd_file_name.html"curl $url -o $temp_file;
复制代码
  • 解析 html 文件, 并将解析结果添加到结果文件中

while read linedo    ### parse the sub string from source data by use the grep patten    result=` echo $line | grep -Eo "$patten1"  `    ### the if the line is include the grep patten    if test -n "$result"  ; then 
# the parse result maybe have multiple values, so traverse the result for r_item in $result; do # parese the torrent value from result result_sed=` echo $r_item | sed "s/$patten2/\1/g" ` # There are tow parsing results of sed command, one for successful matching is torrent link value, # and one for failed matching is the source string . # so we need to test if the mathing result is successful. # if the two strings are not equals, it means the match was successful if [ "$result_sed" != "$r_item" ] ; then #echo $result_sed; #exit; # Write the torrent link value into file echo $result_sed >> $target_file_path; fi done fi done < $temp_file
复制代码
  • 读取文件并添加到剪切板中

# copy the file to clipboardcat $target_file_path | pbcopy
复制代码

大功告成!!!!!!!

  1. 详细代码:

  • .meijutt

## the data directory path of thunder, system environment variable used in the sampleTHUNDER_DATA_DIR=$THUNDER_ROOT_DIR## The regular expression when [grep] command parses the substringPATTEN_TO_GREP='name="down_url_list_0" class="down_url" value="[^<>]*" file_name='## The sed expression when [grep] command parses the substringPATTEN_TO_SED='.*value="\(.*\)".*'
复制代码
  • meijutt.sh

#!/bin/bash#derive the url from https://www.meijutt.comecho " <<<<<<<<<<<<<<<<<<< source from https://www.meijutt.com >>>>>>>>>>>>>>>>>>>>>>>>>>>"# derive the root directory pathSHELL_FOLDER=$(dirname "$0")source "$SHELL_FOLDER/.meijutt"source "$SHELL_FOLDER/source_from_url_common.sh"
复制代码
  • source_from_url_common.sh

#!/bin/bash#derive the url from https://www.loldytt.com# set the source file the parse the urls to file# if test -z "$1" ; then#  echo "please input the html source file !!"#  exit 0# fi#html_file=$1
# # test if the file is exists# if ! test -f "$html_file" ; then# echo "the file [$html_file] is not exits, please input the valid path !!!"# exit 0# fi
# get file data from url if test -z "$1" ; then echo "please input url !!" exit 0fiurl=$1
# get the file then output into a temp file#root_dir="/Users/jackbai/Downloads"root_dir=$THUNDER_DATA_DIR
# test if the root directory is validif ! test -d "$root_dir" ; then echo "the data root direcotry path is not valid, please set the variable [ THUNDER_DATA_DIR ] !!!!" exitfi# echo $root_dir;exit;thd_file_name="thunder"temp_file="$root_dir/$thd_file_name.html"curl $url -o $temp_file; #exit;# echo "temp file: $temp_file"
# read the gb2312 format dataexport LC_CTYPE='C'export LC_COLLATE='C'
#get the file name and use it to name the target filetarget_file_path="$root_dir/$thd_file_name.thdresult"# touch $root_dir/$thd_file_name.thdresultecho "" > $target_file_path# use the regex parse the urlpatten1=$PATTEN_TO_GREPpatten2=$PATTEN_TO_SED# read the file line by linewhile read linedo ### parse the sub string from source data by use the grep patten result=` echo $line | grep -Eo "$patten1" ` ### the if the line is include the grep patten if test -n "$result" ; then # echo "result: "$result; # the parse result maybe have multiple values, so traverse the result for r_item in $result; do #echo "results: "$r_item # parese the torrent value from result result_sed=` echo $r_item | sed "s/$patten2/\1/g" ` # There are tow parsing results of sed command, one for successful matching is torrent link value, # and one for failed matching is the source string . # so we need to test if the mathing result is successful. # if the two strings are not equals, it means the match was successful if [ "$result_sed" != "$r_item" ] ; then #echo $result_sed; #exit; # Write the torrent link value into file echo $result_sed >> $target_file_path; fi done # exit; # write into file # echo ': '$line ; echo ':: '$result; echo $result | sed "s/$patten2/\1/g"; exit # echo $result >> $target_file_path; # exit; fi done < $temp_file# copy the file to clipboardcat $target_file_path | pbcopy
rm -f $temp_filerm -rf $target_file_pathecho "parse ok !!! ";
复制代码


用户头像

还未添加个人签名 2021.03.12 加入

还未添加个人简介

评论

发布
暂无评论
Shell脚本-简单爬虫