use LWP::Simple; #用来得到网页 use HTML::HeadParser; #对部分进行解析,得到标题、编码、关键词等 use HTML::FormatText::WithLinks; #将html变成文本可读样式 use Encode qw/encode decode/; #解决不同网页的编码问题
还试用了把latin1的表修改字符,然后试图用convert转码 update cdb_members set username=CONVERT(username USING utf8),也没成功。
后来当我一筹莫展准备缴械的时候,想起了另外两个功能 SELECT INTO OUTFILE 和 LOAD DATA INFILE 是另外一种导入导出的方法。
贴出使用的sql:
1 2 3 4 5 6 7 8 9 10 11 12
SELECT uid,nickname,site,alipay, icq,qq,yahoo,msn,taobao, location,customstatus, medals,avatar,avatarwidth, avatarheight,bio,signature, sightml,ignorepm,groupterms, authstr INTO OUTFILE 'd:/cdb_memberfields.txt' FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY """" LINES TERMINATED BY "\r\n" FROM gench2.cdb_memberfields;
LOAD DATA INFILE 'd:/cdb_memberfields.txt' INTO TABLE cdb_memberfields FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY """" LINES TERMINATED BY "\r\n";
set global KEY_BUFFER_SIZE=256217728; alter table tablename disable keys;
如何load数据里面带反斜杠(backslash)”' 的数据
由于如果你没有指定FIELDS子句,则默认值为假设您写下如下语句时的值: FIELDS TERMINATED BY '\t' ENCLOSED BY " ESCAPED BY '\\' 所以,如果你的数据里面有反斜杠(backslash)”\”的时候,数据会出现被截断的问题。出现这种问题,只要写上如下的fields子句即可 FIELDS TERMINATED BY '\t' ENCLOSED BY " ESCAPED BY "
如何load不同编码的数据
原来用的4.X的mysql,我是select INTO OUTFILE ,只后用iconv,或者其他软件来做。可以参考这里,但是由于这次数据大,用ultraedit等软件打开都要半天。好在新版的mysql可以增加一个新的参数 CHARACTER SET gbk 我的文本数据是GBK的编码,数据表是utf8的,用这种方法测试成功。
如何load的时候只insert特定的列
比如表里面有比元数据多的列。可以在load的时候指定要插入的字段名字。
示例的代码如下:
1 2 3 4 5 6
LOAD DATA INFILE '~/data.txt' INTO TABLE fb0505 CHARACTER SET gbk FIELDS TERMINATED BY '\t' ENCLOSED BY " ESCAPED BY " LINES TERMINATED BY '\n' STARTING BY " (seller_id,fb_type,fb_date,item_url);