Shell 脚本 - 简单爬虫

追风的少年

关注

发布于: 2021 年 05 月 23 日

编写该爬虫的需求: 每次在下载电视剧的时候, 都要手动的点击很多次才能使用迅雷下载一部电视剧, 很麻烦, 而迅雷会自动监听剪切板, 只要将种子链接地址添加到剪切板, 迅雷会自动启动并下载文件.
算法:

a.使用 curl 获取对应的网址, 然后输出到 html 源文件[thunder.html], 该文件在脚本执行完成后删除.

b.解析 thunder.html 文件, 获取所有的种子链接, 并添加到结果文件[thunder.thdresult], 该文件在脚本执行完成后删除, 具体解析步骤如下:

使用 grep 获取包含种子链接地址值的标签字符串
使用 sed 从标签字符串中获取种子链接地址的值

c.读取种子文件并添加到剪切板中(本机系统为 MacOS).

所需基础知识

a. Shell 的基本知识, 如 if 判断, 变量赋值, 引入外部文件.

b. grep 正则表达式获取子字符串.

c. sed 正则表达式获取子字符串.

d. 文件读取与写入.

e. 使用剪切板.

代码(本人为加强英文学习与使用, 习惯使用英文注释, 看不懂注释的地方请使用 Google 翻译)

a. 代码以获取美剧天天页面作为例子;

b. 文件结构

配置文件[.meijutt], 用于存放不同需求下的配置.
公用 shell[source_from_url_common.sh], 用于实现所有功能, 包括下载网页并存放到文件, 解析文件, 写入文件, 读取文件并添加到剪切板中, 删除过程中的两个文件.
具体执行 shell[meijutt.sh].

代码详解

a. 具体执行 shell [meijutt.sh]

SHELL_FOLDER=$(dirname "$0") # 获取执行脚本的目录

source "$SHELL_FOLDER/.meijutt" # 加载配置文件

source "$SHELL_FOLDER/source_from_url_common.sh" # 加载公用 shell 并执行所有功能

b. 配置文件 [.meijutt]

#迅雷的数据文件目录, 用于存放脚本执行过程中产生的文件

THUNDER_DATA_DIR=$THUNDER_ROOT_DIR

#grep 命令的获取标签字符串的的正则表达式, 此时是在 value 中不包含<>

PATTEN_TO_GREP='name="down_url_list_0" class="down_url" value="[^<>]*" file_name='

#sed 命令获取种子链接地址的正则表达式

PATTEN_TO_SED='.*value="$.*$".*'

c. 公用 shell[source_from_url_common.sh], 只节选部分, 详细代码请翻到底部

获取 url 内容并写入文件

thd_file_name="thunder"temp_file="$root_dir/$thd_file_name.html"curl $url -o $temp_file;

复制代码

解析 html 文件, 并将解析结果添加到结果文件中

while read linedo    ### parse the sub string from source data by use the grep patten    result=` echo $line | grep -Eo "$patten1"  `    ### the if the line is include the grep patten    if test -n "$result"  ; then 
        # the parse result maybe have multiple values, so traverse the result        for r_item in $result;        do            # parese the torrent value from result            result_sed=` echo $r_item  | sed "s/$patten2/\1/g" `            # There are tow parsing results of sed command, one for successful matching is torrent link value,             #   and one for failed matching is the source string .            # so we need to test if the mathing result is successful.            # if the two strings are not equals, it means the match was successful            if [ "$result_sed" != "$r_item" ]  ; then                 #echo $result_sed; #exit;                # Write the torrent link value into file                echo $result_sed >> $target_file_path;            fi                    done    fi    done < $temp_file

复制代码

读取文件并添加到剪切板中

# copy the file to clipboardcat $target_file_path | pbcopy

复制代码

大功告成!!!!!!!

详细代码:

.meijutt

## the data directory path of thunder, system environment variable used in the sampleTHUNDER_DATA_DIR=$THUNDER_ROOT_DIR## The regular expression when [grep] command parses the substringPATTEN_TO_GREP='name="down_url_list_0" class="down_url" value="[^<>]*" file_name='## The sed expression when [grep] command parses the substringPATTEN_TO_SED='.*value="\(.*\)".*'

复制代码

meijutt.sh

#!/bin/bash#derive the url from https://www.meijutt.comecho " <<<<<<<<<<<<<<<<<<< source from https://www.meijutt.com >>>>>>>>>>>>>>>>>>>>>>>>>>>"# derive the root directory pathSHELL_FOLDER=$(dirname "$0")source "$SHELL_FOLDER/.meijutt"source "$SHELL_FOLDER/source_from_url_common.sh"

复制代码

source_from_url_common.sh

#!/bin/bash#derive the url from https://www.loldytt.com# set the source file the parse the urls to file# if test -z "$1" ; then#  echo "please input the html source file !!"#  exit 0# fi#html_file=$1
# # test if the file is exists# if ! test -f "$html_file" ; then#  echo "the file [$html_file] is not exits, please input the valid path !!!"#  exit 0# fi
# get file data from url if test -z "$1" ; then echo "please input url !!" exit 0fiurl=$1
# get the file then output into a temp file#root_dir="/Users/jackbai/Downloads"root_dir=$THUNDER_DATA_DIR
# test if the root directory is validif ! test -d "$root_dir" ; then     echo "the data root direcotry path is not valid, please set the variable [ THUNDER_DATA_DIR ] !!!!"    exitfi# echo $root_dir;exit;thd_file_name="thunder"temp_file="$root_dir/$thd_file_name.html"curl $url -o $temp_file; #exit;# echo "temp file: $temp_file"
# read the gb2312 format dataexport LC_CTYPE='C'export LC_COLLATE='C'
#get the file name and use it to name the target filetarget_file_path="$root_dir/$thd_file_name.thdresult"# touch $root_dir/$thd_file_name.thdresultecho "" > $target_file_path# use the regex parse the urlpatten1=$PATTEN_TO_GREPpatten2=$PATTEN_TO_SED# read the file line by linewhile read linedo    ### parse the sub string from source data by use the grep patten    result=` echo $line | grep -Eo "$patten1"  `    ### the if the line is include the grep patten    if test -n "$result"  ; then         # echo "result: "$result;        # the parse result maybe have multiple values, so traverse the result        for r_item in $result;        do            #echo "results: "$r_item            # parese the torrent value from result            result_sed=` echo $r_item  | sed "s/$patten2/\1/g" `            # There are tow parsing results of sed command, one for successful matching is torrent link value,             #   and one for failed matching is the source string .            # so we need to test if the mathing result is successful.            # if the two strings are not equals, it means the match was successful            if [ "$result_sed" != "$r_item" ]  ; then                 #echo $result_sed; #exit;                # Write the torrent link value into file                echo $result_sed >> $target_file_path;            fi                    done        # exit;        # write into file        #  echo ': '$line ; echo ':: '$result; echo $result | sed "s/$patten2/\1/g"; exit        # echo $result >> $target_file_path;        # exit;    fi    done < $temp_file# copy the file to clipboardcat $target_file_path | pbcopy
rm -f $temp_filerm -rf $target_file_pathecho "parse ok !!! ";

复制代码

发布于: 2021 年 05 月 23 日阅读数: 36

追风的少年

关注

还未添加个人签名 2021.03.12 加入

还未添加个人简介

发布

暂无评论

创作场景

Shell 脚本 - 简单爬虫

追风的少年

评论