原文 http://weikeqin.com/2017/04/11/neo4j-load-csv/
Neo4j load csv 使用
牛刀小试
下面的例子是 把 http://data.neo4j.com/examples/person.csv 的csv文件导入到neo4j数据库里,指定的csv文件分隔符是,
, 只导入第一个字段不为null的数据,根据id为唯一主键导入,导入时设置name为第二个字段,每10000条提交一次。
USING PERIODIC COMMIT 10000
LOAD CSV FROM "http://data.neo4j.com/examples/person.csv" AS line
fieldterminator ','
WHERE line[0] IS NOT NULL
MERGE (n:Person {id: toInt(line[0])})
SET n.name = line[1]
RETURN n
1、 LOAD CSV FROM "http://data.neo4j.com/examples/person.csv"
是读取文件,文件路径可以是url、可以是本地相对路径、本地绝对路径
2、 fieldterminator
是设置分隔符
3、 MERGE
是插入节点 (不存在则插入,存在则更新)
<br>
导入步骤
1、处理数据,确保数据是UTF8无BOM编码,确保数据不存在字符转义问题,确保不存在脏数据
2、检查数据一共有多少行。 如果行数和实际的有差别,检查数据或者cypher语句
3、抽样检查数据。 返回前10条。看格式是否正确,字段是否对应。是否有乱码。
4、导入数据。
如果特别熟悉了,可以直接进行第4步导入数据,如果不太确定,建议还是1-4步都执行一遍。
<br>
<!--more-->
(1) 导入前注意事项
1、 csv一定要用 UTF-8无BOM编码,否则入库时可能乱码
2、 默认使用,
作为分隔符,如果想自定义,可以通过 fieldterminator
设置
3、 导入数据前,一定要校验数据,都是坑呀,总结出的经验
4、 建议使用绝对路径,使用绝对路径时需要把 neo4j.conf
配置文件里的dbms.directories.import=import
注释掉(在前面加个#号)
<br>
(2) 检查csv一共多少行
注意文件路径
(2.1) windows下相对路径方式
Test.csv 放在 ${neo4j_home}/import/Test.csv
LOAD CSV FROM "file:/Test.csv"
AS line
RETURN COUNT(*);
(2.2) windows下绝对路径方式
文件放在 C:/User/wdb/2017-04-06_test.csv
LOAD CSV FROM "file:///C:/User/wdb/2017-04-06_test.csv"
AS line
RETURN COUNT(*);
(2.3) linux下相对路径格式
Test.csv 放在 ${neo4j_home}/import/Test.csv
LOAD CSV FROM "file:/2017-04-06_test.csv"
AS line
RETURN COUNT(*);
(2.4) linux下绝对路径格式
LOAD CSV FROM "file:/home/wkq/databases/data/2017-04-06_test.csv"
AS line
RETURN COUNT(*);
<br>
(3) 校验数据
注意文件路径
文件使用UTF-8编码,否则可能导致乱码
(3.1) 检查数据 不带标题
LOAD CSV FROM "file:///C:/User/wdb/Test.csv"
AS line
WITH line
RETURN line
LIMIT 5
(3.2) 检查数据 带标题
LOAD CSV WITH HEADERS FROM "file:///C:/User/wdb/Test.csv"
AS line
WITH line
RETURN line
LIMIT 5
<br>
(4) 真正导入数据
<!--
// 第一次字段用汉字,发现很慢
load csv with headers from "file:/Test.csv" as line with line create (:Person {姓名:line.姓名, 序号:toInt(line.序号), 联系方式:toInt(line.联系方式), 性别:line.性别, 班级:line.班级}) ;
// 第二次 优化后 没有用批量提交,也有点慢
load csv with headers from "file:/Test.csv" as line with line create (:Person {name:line.姓名, number:toInt(line.序号), tel:toInt(line.联系方式), sex:line.性别, class:line.班级}) ;
-->
(4.1) 导入数据 csv文件带文件头
注意WITH HEADERS的写法和不用HEADERS的写法,一个用的是 line.name 一个用的是 line[0]
导入时可以使用函数 toInt('1') toFloat('1.0')
toInteger(), toFloat(), split()
LOAD CSV WITH HEADERS FROM "file:///C:/User/wdb/Test.csv"
AS line
WITH line
CREATE (:Person {name:line.name, number:toInt(line.number), tel:toInt(line.tel), sex:line.sex, class:line.classmate}) ;
(4.2) 导入数据 csv文件不带文件头
LOAD CSV FROM "file:///C:/User/wdb/Test.csv"
AS line
WITH line
CREATE (:Person {name:line[0], number:toInt(line[1]), tel:toInt(line[2]), sex:line[3], class:line[4]}) ;
<br>
(5) 注事事项
(5.1) 批量提交
数据量大了以后可以使用批量提交
using periodic commit 10000
表示 每10000行进行一次事务提交
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (pet:Pet {petId: row.PetId})
MERGE (owner:Owner {ownerId: row.OwnerId})
ON CREATE SET owner.name = row.OwnerName
MERGE (pet)-[r:OWNED_BY]->(owner)
(5.2) 处理空值
处理空值
//skip null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row WHERE row.Company IS NOT NULL
MERGE (c:Company {companyId: row.Id})
//set default for null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})
//change empty strings to null values (not stored)
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END
(5.3) 使用split()
//split string of employee skills into separate nodes
LOAD CSV FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.Id})
UNWIND split(row.skills, ',') AS skill
MERGE (s:Skill {name: skill})
MERGE (e)-[r:HAS_EXPERIENCE]->(s);
(5.4) 使用CASE
//set businessType property based on shortened value in CSV
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row,
(CASE row.BusinessType
WHEN 'P' THEN 'Public'
WHEN 'R' THEN 'Private'
WHEN 'G' THEN 'Government'
ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.CompanyId})
SET c.businessType = type
RETURN *
(5.5) MERGE relation
为获得最佳性能,请始终在具有索引主键属性的单个标签上使用MATCH和MERGE。
您还应该将节点和关系创建分离为单独的语句。
以下两种写法都没问题,第一种相对第二种更占内存,第二种在内存有限的情况下性能会好一点
MERGE (e:Employee {employeeId: row.employeeId})
MERGE (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)
官方推荐在具体的Label的属性上建立索引使用MATCH和MERGE,并且把创建节点和创建关系分开写
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId})
RETURN count(e);
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.companyId})
RETURN count(c);
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId})
MATCH (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)
RETURN count(*);
评论