2016-08-31 nuch&solr 使用

版本说明

nutch: apache-nutch-1.12
solr: solr-6.2.0

百度和 google 了很多, 但都是以前的版本, 有的配置文件和目录也对不上, nutch 和 solr 不同版本之间的区别还是挺大的.

问题处理

问题1:

...
Indexing 1/1 documents
Deleting 0 documents
Indexing 1/1 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

在执行完命令nutch solrindex 后报错Job failed, 这种提示也看不出什么, 详细日志要查看apache-nutch-1.12/logs/hadoop.log

报错1:

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>

原因一: 没有配置 core( 独立模式)或 collection( 云模式), 导致 http 请求失败.
原因二: nutch solrindex 命令参数不对.
错误的

nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20160829173924 -filter -normalize

正确的:

nutch solrindex http://localhost:8983/solr/new_core crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20160829173924 -filter -normalize

虽然在 solr 管理页面中是访问http://localhost:8983/solr/#/new_core, 但这里对应地址应该是http://localhost:8983/solr/new_core.

url 中solr 后面的new_core就是新配置的 core, 在这里配置:
11:27:54.jpg

报错2

原因是因为在 schema.xml 中没有配置必要的 field

...
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/new_core: ERROR: [doc=http://nutch.apache.org/] unknown field 'content'
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/new_core: ERROR: [doc=http://nutch.apache.org/] unknown field 'content'
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
    ...

根据提示的 field 在这里添加即可, 对于的文件是solr-6.2.0/server/solr/new_core/conf/managed-schema

11:26:48.jpg
managed-schema.xml中新增的配置如下:
managed-schema.xml中新增的配置

参考资料

Nutch搜索引擎(第3期)_ Nutch简单应用 - 虾皮 - 博客园
Solr5.0快速入门 - 邹中凡 - 博客园
nutch + solr —— 搭建初探 - kradnangel的专栏 - 博客频道 - CSDN.NET
【solr基础教程之二】索引 - jediael_lu的专栏 - 博客频道 - CSDN.NET

2016-08-31 00:0051