Apache Nutch with Solr on Debian 9 or Ubuntu 17.04

About one year ago I wrote an article for Linux Magazin about a personal search engine:

* Suchmaschinenserver im Eigenbau (Ausgabe 02/2016)
* Go Find It (Issue #186 / May 2016)

In this article I used Ubuntu 14.04 and Nutch 1.9. Now Nutch 1.13 is available. Time for a short update.

First the bad news. Ubuntu 17.04 and Debian 9 both delivering version 3.6.2 of Solr. Which is not compatible with Nutch 1.13.
If you want to use the latest version of Nutch you have to install Solr by hand.

With Solr 3.6.2 and Nutch 1.13 you get an error like this:

org.apache.solr.common.SolrException.log org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe0 (at char #1, byte #-1)

If you came here because you googled this error message: Short answer your “indexer-solr” plugin is not compatible with your solr version, you can try to use the “indexer-solr” plugin from Nutch 1.12 with Nutch 1.13. But at the moment I don’t know wether there are any other problems coming with this solution.

Best way for me at the moment is the following quick setup:

Install Solr

apt‐get install solr‐tomcat

Download and install Nutch

wget http://archive.apache.org/dist/nutch/1.12/apache-nutch-1.12-bin.tar.gz
tar vfx apache-nutch-1.12-bin.tar.gz
mv apache-nutch-1.12 /opt/
ln -s /opt/apache-nutch-1.12 /opt/nutch

Configurate Solr for Nutch

mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig
cp /opt/nutch/conf/schema.xml /etc/solr/conf/schema.xml

/etc/init.d/tomcat8 restart

Note: The value content stored=”true” is now default and don’t need to be changed.

Configurate Nutch

Edit or create “/opt/nutch/conf/nutch‐site.xml” with following content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
          <name>http.agent.name</name>
          <value>My Search Agent</value>
  </property>

  <property>
        <name>file.content.ignored</name>
        <value>false</value>
  </property>

  <property>
        <name>db.update.purge.404</name>
        <value>true</value>
  </property>

  <property>
        <name>indexer.max.title.length</name>
        <value>150</value>
  </property>

  <property>
        <name>fetcher.server.delay</name>
        <value>0.0</value>
  </property>

  <property>
        <name>solr.server.url</name>
        <value>http://localhost:8080/solr/</value>
  </property>

  <property>
        <name>indexingfilter.order</name>
        <value>indexer-solr</value>
   </property>

</configuration>

Example Setup

Now I create a simple example setup for the crawler. The following setup allows the crawler to browse only my block and don’t follow any “external” links.

Edit “/opt/nutch/conf/regex-urlfilter.txt” and add the following line:

+^(http|https)://www.mogilowski.net

On the end of the file i chaned “+.” to “-.” to deny all other URLs.

Create directories and seed

Now i prepare the directories and create a seed file with the start URL.

mkdir /opt/nutch/IntranetCrawler
mkdir /opt/nutch/urls
echo "http://www.mogilowski.net" > /opt/nutch/urls/seed.txt

Let’s go

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
/opt/nutch/bin/crawl --index /opt/nutch/urls/ /opt/nutch/IntranetCrawler/ 1

Check the results

Go to “http://YOUR_SERVER:8080/solr/” and make a select in the solr admin with “*.*”. You should get the results of the first run as XML.

Notes

For more informations about the nutch config or how to access the solr data take a look at:

* Suchmaschinenserver im Eigenbau (Ausgabe 02/2016)
* Go Find It (Issue #186 / May 2016)

Security !

By default Solr is public accessable (read/write) without passwords on all IPs !!

Schreibe einen Kommentar

eMail-Benachrichtigung bei weiteren Kommentaren.
Auch möglich: Abo ohne Kommentar.

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.