During one of the recent FAST implementations, I faced another interesting issue with crawling internet sites and it is confirmed by Microsoft to be a product (SharePoint 2010) issue. The details of my findings are given below.
In SharePoint server 2010, you received following error when you try to crawl an anonymous + NTLM enabled web site (non-SharePoint site)
“Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has “Full Read” permissions on the SharePoint Web Application being crawled.”
It is confirmed to be a product issue.
Such issue can occur when you ever set a default content access account. In such scenario, Search will incorrectly pick up the default account and assume the site to crawl is NTLM and it fails to crawl the site using NTLM, it does not even fall back to anonymous.
MS Product Group has confirmed this to be product issue. However we were told, since changes required to fix the issue is big, they will more consider fixing it in future release of the product.
The below 2 workarounds are available to this issue, either of them will solve the issue
I. Remove and re-provision a new search service application ( I know its a pain spcially if you have already set up hunderds of crawl rules/etc…). After the new search service application is provisioned, DO NOT set or change the default content access account. (unfortunately, there is no public interface that can be used to take away the default content access account once it is there). Even retyping the password for the same default content access account will set the authentication type to NTLM as explained and will trigger the issue.
II. Create a crawl rule for the start address of the site and configure it to use cookie.
- Create a dummy text file, i.e. dummy_cooki.txt, to contain some dummy text: E.g. Test: testing
- Create a crawl rule to use cookie to crawl, and locate the cookie created in step 1
Specify whatever Error.aspx (it is more a dummy or filler, and is not really used)
The idea is to configure SharePoint to crawl the site using cookie (from some degree, this can also be interpreted as anonymous). As long as the target site does not have the extra dummy cookie (many web sites don’t specifically look for cookies other than those sent by themselves), crawl can succeed.
Have a nice Day !!!