Nougat：结合光学神经网络，引领学术 PDF 文档的智能解析、挖掘学术论文 PDF 的价值

2023-12-13
浙江
本文字数：2716 字
阅读完需：约 9 分钟

Nougat：结合光学神经网络，引领学术 PDF 文档的智能解析、挖掘学术论文 PDF 的价值

这是 Nougat 的官方存储库，Nougat 是一种学术文档 PDF 解析器，可以理解 LaTeX 数学和表格。

Project page: https://facebookresearch.github.io/nougat/

1.安装

From pip:

pip install nougat-ocr

复制代码

From repository:

pip install git+https://github.com/facebookresearch/nougat

复制代码

Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here

如果您想从 API 调用模型或生成数据集，则会有额外的依赖项。安装通过

pip install "nougat-ocr[api]" or pip install "nougat-ocr[dataset]"

1.2 获取 PDF 的预测

1.2.1 CLI

To get predictions for a PDF run

$ nougat path/to/file.pdf -o output_directory

复制代码

目录或文件的路径(其中每行都是 PDF 的路径)也可以作为位置参数传递

$ nougat path/to/directory -o output_directory

复制代码

usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]
positional arguments:  pdf                   PDF(s) to process.
options:  -h, --help            show this help message and exit  --batchsize BATCHSIZE, -b BATCHSIZE                        Batch size to use.  --checkpoint CHECKPOINT, -c CHECKPOINT                        Path to checkpoint directory.  --model MODEL_TAG, -m MODEL_TAG                        Model tag to use.  --out OUT, -o OUT     Output directory.  --recompute           Recompute already computed PDF, discarding previous predictions.  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.  --no-markdown         Do not add postprocessing step for markdown compatibility.  --markdown            Add postprocessing step for markdown compatibility (default).  --no-skipping         Don't apply failure detection heuristic.  --pages PAGES, -p PAGES                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.

复制代码

The default model tag is 0.1.0-small. If you want to use the base model, use 0.1.0-base.

$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

复制代码

In the output directory every PDF will be saved as a .mmd file, the lightweight markup language, mostly compatible with Mathpix Markdown (we make use of the LaTeX tables).

Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of [MISSING_PAGE] responses, try to run with the --no-skipping flag. Related: #11, #67

1.2.2 API

With the extra dependencies you use app.py to start an API. Call

$ nougat_api

复制代码

通过向http://127.0.0.1:8503/ predict/发出 POST 请求来获得 PDF 文件的预测。它还接受参数“start”和“stop”，以限制计算选择页码(包括边界)。

响应是一个带有文档标记文本的字符串。

curl -X 'POST' \  'http://127.0.0.1:8503/predict/' \  -H 'accept: application/json' \  -H 'Content-Type: multipart/form-data' \  -F 'file=@<PDFFILE.pdf>;type=application/pdf'

复制代码

To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

2.Dataset

2.1 生成数据集

To generate a dataset you need

A directory containing the PDFs
A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

Next run

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

复制代码

Additional arguments include

Finally create a jsonl file that contains all the image paths, markdown text and meta information.

python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl

复制代码

For each jsonl file you also need to generate a seek map for faster data loading:

python -m nougat.dataset.gen_seek file.jsonl

复制代码

The resulting directory structure can look as follows:

root/├── images├── train.jsonl├── train.seek.map├── test.jsonl├── test.seek.map├── validation.jsonl└── validation.seek.map

复制代码

Note that the .mmd and .json files in the path/paired/output (here images) are no longer required.This can be useful for pushing to a S3 bucket by halving the amount of files.

2.2Training

To train or fine tune a Nougat model, run

python train.py --config config/train_nougat.yaml

复制代码

2.3 Evaluation

Run

python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json

复制代码

To get the results for the different text modalities, run

python -m nougat.metrics path/to/results.json

复制代码

2.4 FAQ

Why am I only getting [MISSING_PAGE]?
Nougat was trained on scientific papers found on arXiv and PMC. Is the document you're processing similar to that?What language is the document in? Nougat works best with English papers, other Latin-based languages might work. Chinese, Russian, Japanese etc. will not work.If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs (#11). Try passing the --no-skipping flag for now.
Where can I download the model checkpoint from.
They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing --model 0.1.0-{base,small}

参考链接：https://github.com/facebookresearch/nougat

更多优质内容请关注公号：汀丶人工智能；会提供一些相关的资源和优质文章，免费获取阅读。

发布于: 刚刚阅读数: 4

原文链接:【http://xie.infoq.cn/article/0adcda172ae4b5dd970469ddf】。

汀丶人工智能

关注

本博客将不定期更新关于NLP等领域相关知识 2022-01-06 加入

本博客将不定期更新关于机器学习、强化学习、数据挖掘以及NLP等领域相关知识，以及分享自己学习到的知识技能，感谢大家关注！

发布

暂无评论

创作场景

Nougat：结合光学神经网络，引领学术 PDF 文档的智能解析、挖掘学术论文 PDF 的价值

Nougat：结合光学神经网络，引领学术 PDF 文档的智能解析、挖掘学术论文 PDF 的价值

1.安装

1.2 获取 PDF 的预测

1.2.1 CLI

1.2.2 API

2.Dataset

2.1 生成数据集

2.2Training

2.3 Evaluation

2.4 FAQ

汀丶人工智能

评论