Crawling private pages using nutch

掲示板

8年前に Adam Ziubrzynski によって更新されました。

Crawling private pages using nutch

New Member 投稿: 11 参加年月日: 13/08/08 最新の投稿

Hi,

We currently have nutch and solr set up crawling and indexing our public facing websites.

Now we are looking to create sites for internal use, crawl them and use solr for search in the same manner except the pages will now be private, rather than public.

I'm having trouble authenticating with nutch so that pages can be parsed etc. during crawling.

I've created a user account for the purposes of crawling that only has view permissions on the pages.

I've set up everything in the same manner as for our other sites, except I've tried to use credential options within nutch configuration as detailed: here

It looks like the crawler is redirected to the site's login page, where indexing stops even though credentials are provided.

Has anyone done this before? Or anything I might be missing?

Does Liferay support http authentication or do I need to configure an alternative authentication scheme?

Any help would be appreciated.

Thanks,