编写该爬虫的需求: 每次在下载电视剧的时候, 都要手动的点击很多次才能使用迅雷下载一部电视剧, 很麻烦, 而迅雷会自动监听剪切板, 只要将种子链接地址添加到剪切板, 迅雷会自动启动并下载文件.
算法:
a.使用 curl 获取对应的网址, 然后输出到 html 源文件[thunder.html], 该文件在脚本执行完成后删除.
b.解析 thunder.html 文件, 获取所有的种子链接, 并添加到结果文件[thunder.thdresult], 该文件在脚本执行完成后删除, 具体解析步骤如下:
c.读取种子文件并添加到剪切板中(本机系统为 MacOS).
所需基础知识
a. Shell 的基本知识, 如 if 判断, 变量赋值, 引入外部文件.
b. grep 正则表达式获取子字符串.
c. sed 正则表达式获取子字符串.
d. 文件读取与写入.
e. 使用剪切板.
代码(本人为加强英文学习与使用, 习惯使用英文注释, 看不懂注释的地方请使用 Google 翻译)
a. 代码以获取美剧天天页面作为例子;
b. 文件结构
配置文件[.meijutt], 用于存放不同需求下的配置.
公用 shell[source_from_url_common.sh], 用于实现所有功能, 包括下载网页并存放到文件, 解析文件, 写入文件, 读取文件并添加到剪切板中, 删除过程中的两个文件.
具体执行 shell[meijutt.sh].
代码详解
a. 具体执行 shell [meijutt.sh]
SHELL_FOLDER=$(dirname "$0") # 获取执行脚本的目录
source "$SHELL_FOLDER/.meijutt" # 加载配置文件
source "$SHELL_FOLDER/source_from_url_common.sh" # 加载公用 shell 并执行所有功能
b. 配置文件 [.meijutt]
#迅雷的数据文件目录, 用于存放脚本执行过程中产生的文件
THUNDER_DATA_DIR=$THUNDER_ROOT_DIR
#grep 命令的获取标签字符串的的正则表达式, 此时是在 value 中不包含<>
PATTEN_TO_GREP='name="down_url_list_0" class="down_url" value="[^<>]*" file_name='
#sed 命令获取种子链接地址的正则表达式
PATTEN_TO_SED='.*value="\(.*\)".*'
c. 公用 shell[source_from_url_common.sh], 只节选部分, 详细代码请翻到底部
thd_file_name="thunder"
temp_file="$root_dir/$thd_file_name.html"
curl $url -o $temp_file;
复制代码
while read line
do
### parse the sub string from source data by use the grep patten
result=` echo $line | grep -Eo "$patten1" `
### the if the line is include the grep patten
if test -n "$result" ; then
# the parse result maybe have multiple values, so traverse the result
for r_item in $result;
do
# parese the torrent value from result
result_sed=` echo $r_item | sed "s/$patten2/\1/g" `
# There are tow parsing results of sed command, one for successful matching is torrent link value,
# and one for failed matching is the source string .
# so we need to test if the mathing result is successful.
# if the two strings are not equals, it means the match was successful
if [ "$result_sed" != "$r_item" ] ; then
#echo $result_sed; #exit;
# Write the torrent link value into file
echo $result_sed >> $target_file_path;
fi
done
fi
done < $temp_file
复制代码
# copy the file to clipboard
cat $target_file_path | pbcopy
复制代码
大功告成!!!!!!!
详细代码:
## the data directory path of thunder, system environment variable used in the sample
THUNDER_DATA_DIR=$THUNDER_ROOT_DIR
## The regular expression when [grep] command parses the substring
PATTEN_TO_GREP='name="down_url_list_0" class="down_url" value="[^<>]*" file_name='
## The sed expression when [grep] command parses the substring
PATTEN_TO_SED='.*value="\(.*\)".*'
复制代码
#!/bin/bash
#derive the url from https://www.meijutt.com
echo " <<<<<<<<<<<<<<<<<<< source from https://www.meijutt.com >>>>>>>>>>>>>>>>>>>>>>>>>>>"
# derive the root directory path
SHELL_FOLDER=$(dirname "$0")
source "$SHELL_FOLDER/.meijutt"
source "$SHELL_FOLDER/source_from_url_common.sh"
复制代码
#!/bin/bash
#derive the url from https://www.loldytt.com
# set the source file the parse the urls to file
# if test -z "$1" ; then
# echo "please input the html source file !!"
# exit 0
# fi
#html_file=$1
# # test if the file is exists
# if ! test -f "$html_file" ; then
# echo "the file [$html_file] is not exits, please input the valid path !!!"
# exit 0
# fi
# get file data from url
if test -z "$1" ; then
echo "please input url !!"
exit 0
fi
url=$1
# get the file then output into a temp file
#root_dir="/Users/jackbai/Downloads"
root_dir=$THUNDER_DATA_DIR
# test if the root directory is valid
if ! test -d "$root_dir" ; then
echo "the data root direcotry path is not valid, please set the variable [ THUNDER_DATA_DIR ] !!!!"
exit
fi
# echo $root_dir;exit;
thd_file_name="thunder"
temp_file="$root_dir/$thd_file_name.html"
curl $url -o $temp_file; #exit;
# echo "temp file: $temp_file"
# read the gb2312 format data
export LC_CTYPE='C'
export LC_COLLATE='C'
#get the file name and use it to name the target file
target_file_path="$root_dir/$thd_file_name.thdresult"
# touch $root_dir/$thd_file_name.thdresult
echo "" > $target_file_path
# use the regex parse the url
patten1=$PATTEN_TO_GREP
patten2=$PATTEN_TO_SED
# read the file line by line
while read line
do
### parse the sub string from source data by use the grep patten
result=` echo $line | grep -Eo "$patten1" `
### the if the line is include the grep patten
if test -n "$result" ; then
# echo "result: "$result;
# the parse result maybe have multiple values, so traverse the result
for r_item in $result;
do
#echo "results: "$r_item
# parese the torrent value from result
result_sed=` echo $r_item | sed "s/$patten2/\1/g" `
# There are tow parsing results of sed command, one for successful matching is torrent link value,
# and one for failed matching is the source string .
# so we need to test if the mathing result is successful.
# if the two strings are not equals, it means the match was successful
if [ "$result_sed" != "$r_item" ] ; then
#echo $result_sed; #exit;
# Write the torrent link value into file
echo $result_sed >> $target_file_path;
fi
done
# exit;
# write into file
# echo ': '$line ; echo ':: '$result; echo $result | sed "s/$patten2/\1/g"; exit
# echo $result >> $target_file_path;
# exit;
fi
done < $temp_file
# copy the file to clipboard
cat $target_file_path | pbcopy
rm -f $temp_file
rm -rf $target_file_path
echo "parse ok !!! ";
复制代码
评论