写点什么

[数据分析实践]- 文本分析 -U.S. Patent Phrase-1

作者:浩波的笔记
  • 2022 年 6 月 15 日
  • 本文字数:4285 字

    阅读完需:约 14 分钟

[数据分析实践]-文本分析-U.S. Patent Phrase-1

数据背景

美国专利商标局 (USPTO) 通过其开放数据门户提供世界上最大的科学、技术和商业信息库之一。专利是一种授予知识产权的形式,以换取公开披露新的和有用的发明。由于专利在授予前经过了严格的审查程序,并且由于美国创新的历史跨越了两个世纪和 1100 万项专利,因此美国专利档案是数据量、质量和多样性的罕见组合。


“美国专利商标局通过授予专利、注册商标和在全球推广知识产权,为美国的创新机器提供服务。从灯泡到量子计算机,美国专利商标局与世界分享了 200 多年的人类智慧。结合数据科学界的创造力,USPTO 数据集具有无限的潜力,可以增强 AI 和 ML 模型,这将有利于科学和整个社会的进步。”


数据介绍

数据集来源:https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data


  • id - 一对短语的唯一标识符

  • anchor - 第一个短语

  • target - 第二个短语

  • context - CPC 分类(版本 2021.05),表示要对相似度进行评分的主题

  • score - 相似度。 这来自一个或多个手动专家评级的组合。




import pandas as pdfrom termcolor import coloredimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as pxfrom wordcloud import WordCloud, STOPWORDSimport numpy as npimport bq_helperfrom bq_helper import BigQueryHelper
import warningswarnings.filterwarnings("ignore")
复制代码


我将在一个新颖的语义相似性数据集上训练您的模型,以通过匹配专利文档中的关键短语来提取相关信息。 在专利检索和审查过程中,确定短语之间的语义相似性对于确定之前是否已经描述过一项发明至关重要。


例如,如果一项发明声称是“television set”,而先前的出版物描述了“TV set”,那么理想情况下,模型会识别出它们是相同的,并帮助专利代理人或审查员检索相关文件。


train_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")test_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")
print(f"Number of observations in TRAIN: {colored(train_df.shape, 'yellow')}")print(f"Number of observations in TEST: {colored(test_df.shape, 'yellow')}")
#Number of observations in TRAIN: (36473, 5)#Number of observations in TEST: (36, 4)
复制代码


Number of observations in TRAIN: (36473, 5)Number of observations in TEST: (36, 4)


#看看训练数据集中的前 20 个观察值。train_df.sample(10)
复制代码


ANCHOR COLUMN

print(f"Number of uniques values in ANCHOR column: {colored(train_df.anchor.nunique(), 'yellow')}")#Number of uniques values in ANCHOR column: 733
train_df.anchor.value_counts().head(20)
复制代码



pattern = 'base'mask = train_df['target'].str.contains(pattern, case=False, na=False)train_df.query("anchor =='component composite coating'")[mask]
复制代码



anchor_desc = train_df[train_df.anchor.notnull()].anchor.valuesstopwords = set(STOPWORDS) wordcloud = WordCloud(width = 800,                       height = 800,                      background_color ='white',                      min_font_size = 10,                      stopwords = stopwords,).generate(' '.join(anchor_desc)) 
# plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0)
plt.show()
复制代码



train_df['anchor_len'] = train_df['anchor'].str.split().str.len()
print(f"Anchors with maximum lenght of 5: \n{colored(train_df.query('anchor_len == 5')['anchor'].unique(), 'yellow')}")print(f"\nAnchors with maximum lenght of 4: \n{colored(train_df.query('anchor_len == 4')['anchor'].unique(), 'green')}")
复制代码


Anchors with maximum lenght of 5:['make of high density polyethylene''produce by recombinant dna technology''reflection type liquid crystal display''rotate on its longitudinal axis']


Anchors with maximum lenght of 4:['align with input shaft' 'apply to anode electrode''average power ratio reduction' 'coat with conducting layer''combine with optical elements' 'connect to common conductor''connect to electrode structure' 'consist of oxalic acid''disk type recording medium' 'disperse in plastic material''dissolve in solvent system' 'engage in guide slot''extend from groove bottom' 'fall to low value''high gradient magnetic separators' 'operate internal combustion engine''peripheral nervous system stimulation' 'pulse width modulated control''recover from reaction product' 'reflect by reflection mirror''remain below threshold value' 'send to control node''show in chemical formula' 'transparent liquid crystal display''use as cooling fluid' 'use physically unclonable functions']


 train_df.anchor_len.hist(orientation='horizontal', color='#FFCF56')
复制代码



 pattern = '[0-9]'mask = train_df['anchor'].str.contains(pattern, na=False)train_df['num_anchor'] = masktrain_df[mask]['anchor'].value_counts()
复制代码


TARGET COLUMN

print(f"Number of uniques values in TARGET column: {colored(train_df.target.nunique(), 'yellow')}")#Number of uniques values in TARGET column: 29340
train_df.target.value_counts().head(20)
复制代码



target_desc = train_df[train_df.target.notnull()].target.valuesstopwords = set(STOPWORDS) wordcloud = WordCloud(width = 800,                       height = 800,                      background_color ='white',                      min_font_size = 10,                      stopwords = stopwords,).generate(' '.join(target_desc)) 
# plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0)
plt.show()
复制代码



train_df['target_len'] = train_df['target'].str.split().str.len()train_df.target_len.value_counts()
复制代码



print(f"Targets with maximum lenght of 11: \n{colored(train_df.query('target_len == 11')['target'].unique(), 'yellow')}")print(f"\nTargets with lenght of 10: \n{colored(train_df.query('target_len == 10')['target'].unique(), 'green')}")print(f"\nTargets with lenght of 9: \n{colored(train_df.query('target_len == 9')['target'].unique(), 'yellow')}")print(f"\nTargets with lenght of 8: \n{colored(train_df.query('target_len == 8')['target'].unique(), 'green')}")
复制代码


Targets with maximum lenght of 11:['n 9 fluorenylmethyloxycarbonyl 3 amino 3 45 dimethoxy 2 nitrophenylpropionic acid']


Targets with lenght of 10:['a substance used as a reagent in a rocket engine''heating calcium oxide and aluminium oxide together at high temperatures''a quadric surface that has exactly one axis of symmetry']


Targets with lenght of 9:['testing the life of a leakage current protection device''a quadric surface that has no center of symmetry''machine that converts the kinetic energy of a fluid']


Targets with lenght of 8:['loading sequence of a breech loading naval gun''loading sequence of a breech loading small arm''gearbox has two clutches but no clutch pedal''partial displacement of a bone from its joint''inflatable curtain module for use in a vehicle''conveyors are used to convey larger sized items''ability of an article to withstand prolonged wear']


 # Checking numbers in target feature
pattern = '[0-9]'mask = train_df['target'].str.contains(pattern, na=False)train_df['num_target'] = masktrain_df[mask]['target'].value_counts()
复制代码



pattern = '1 multiplexer'mask = train_df['target'].str.contains(pattern, na=False)train_df[mask]
复制代码


CONTEXT COLUMN

资料来源:https://en.wikipedia.org/wiki/Cooperative_Patent_Classification


第一个字母是“截面符号”,由“A”(“人类必需品”)到“H”(“电力”)或“Y”的字母组成,表示新兴的横截面技术。 后面是一个两位数的数字,表示“类符号”(“A01”代表“农业;林业;畜牧业;诱捕;渔业”)。


  • A:人类必需品

  • B:运营和运输

  • C:化学与冶金

  • D:纺织品

  • E:固定结构

  • F:机械工程

  • G:物理学

  • H:电力

  • Y:新兴的横截面技术


print(f"Number of uniques values in CONTEXT column: {colored(train_df.context.nunique(), 'yellow')}")#Number of uniques values in CONTEXT column: 106
train_df.context.value_counts().head(20)
复制代码



train_df['section'] = train_df['context'].astype(str).str[0]train_df['classes'] = train_df['context'].astype(str).str[1:]train_df.head(10)
复制代码



print(f"Number of uniques SECTIONS: {colored(train_df.section.nunique(), 'yellow')}")print(f"Number of uniques CLASS: {colored(train_df.classes.nunique(), 'yellow')}")#Number of uniques SECTIONS: 8#Number of uniques CLASS: 44
di = {"A" : "A - Human Necessities", "B" : "B - Operations and Transport", "C" : "C - Chemistry and Metallurgy", "D" : "D - Textiles", "E" : "E - Fixed Constructions", "F" : "F- Mechanical Engineering", "G" : "G - Physics", "H" : "H - Electricity", "Y" : "Y - Emerging Cross-Sectional Technologies"} train_df.replace({"section": di}).section.hist(orientation='horizontal', color='#FFCF56')
复制代码



train_df.classes.value_counts().head(15)
复制代码


score

train_df.score.hist(color='#FFCF56')train_df.score.value_counts()
复制代码



train_df[['anchor', 'target', 'section', 'classes', 'score']].replace({"section": di}).query('score==1.0')
复制代码



train_df[['anchor', 'target', 'section', 'classes', 'score']].replace({"section": di}).query('score==0.0')
复制代码



用户头像

还未添加个人签名 2022.05.12 加入

还未添加个人简介

评论

发布
暂无评论
[数据分析实践]-文本分析-U.S. Patent Phrase-1_数据分析_浩波的笔记_InfoQ写作社区