Monday, June 29, 2015

BCS Models - Part 8: Crawl Results

This post is part of an eight part series describing the process I have followed to be able create a BCS model capable of indexing well over 10 million items.

Series Index:
BCS Models - Part 1: Target Database
BCS Models - Part 2: Initial BCS External Content Types
BCS Models - Part 3: Crawl Results
BCS Models - Part 4: Bigger Database
BCS Models - Part 5: The Bigger Database
BCS Models - Part 6: How to eat this elephant?
BCS Models - Part 7: Changes to the BCS Model to support segmented crawl
BCS Models - Part 8: Crawl Results <-- You are here
This series of blog posts reaches its culmination here.  The proof that the BCS SQL Connector can be used to crawl extremely large LOB datasets in a way that doesn't kill the crawler.

I'm using the BCS Model created in BCS Models - Part 7: Changes to the BCS Model to support segmented crawl to crawl a database created from an extract from the March 2015 export from StackOverflow.StackExchange.com.

In order to have the crawls run quickly I've limited by segment size to 1000 rows and will be crawling the first few segments to show the mechanics of the process.  Once I've done that I'll try letting my little single-server farm attempt to index the entire corpus.

I'm going to use both SQL Server Profiler to capture the stored procedure invocations and ULSViewer to capture the c73i events.

Here's the results of my Full Crawl (limited to the first 5 segments on each of the segment entities):
In this trace we see that within 6/100's of a second all seven of the segments are being enumerated.  About a minute later we see the seven child Entities being crawl via their AssociationNavigator and finally SharePoint needs to query three specific posts.  This is where giving the MethodInstance a meaningful name is helpful.

In the crawl log we see that we indexed 35,043 items


The Crawl Queue Health Report shows how fast the links from the segment accumulated.


Now to investigate the impact of the segments on the crawler.  Brian Pendergast shows we can use eventid dw3a if we enable verboseex on the search crawler stuff at Crushing the 1-million-item-limit myth with .NET Search Connector [BDC].

Looking in the MSSCrawlURL DB with a series of queries: 
select * from msscrawlurl where ParentDocID = -1
select cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1

select cu.*
  from MSSCrawlURL cu
join (
select top 1 cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1
) a on a.DocID = cu.ParentDocID

select cu.parentdocid, count(cu.docid)
from MSSCrawlURL cu
join (
select cu.*
  from MSSCrawlURL cu
join (
select top 1 cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1
) a on a.DocID = cu.ParentDocID
) b on cu.ParentDocID = b.DocID
group by cu.ParentDocID

select cu.DocId, cu.ParentDocID, cur.DisplayURL, cur.ErrorLevel,cur.ErrorDesc
from MSSCrawlURL cu
join (
select top 1 cu.*
  from MSSCrawlURL cu
join (
select top 1 cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1
) a on a.DocID = cu.ParentDocID
) b on cu.ParentDocID = b.DocID
join MSSCrawlURLReport cur on cur.URLID = cu.DocID
I get results like:
  1. The first result set is the root of the crawl.
  2. The second result set is the records generated by enumerating the seven segment entities (Finder).  
  3. The third result set is showing the first 5 segments to be enumerated from the first segment entity.
  4. The fourth result set is showing the number of items enumerated in each of the segments for the first Entity
  5. The fifth result is the first few of the 1000 items crawled.

I ran a full crawl and enabled verboseex tracing on all trace entries with *crawl* in the name.  The ULSViewer shots below are filter to event ids c73i and dw3a.

Here's the first link being written (dw3a) that the whole crawl seems to hang off of.  This record has the -1 SourceDocID:

This is followed by seven more links being written, one for each of the Segment Entities.  The line highlighted shows the SourceDocID of 1. All seven of these do.

This is followed by the seven Segment Entities' MethodInstances being invoked

These are followed by more link inserts, for the Entities in each Entity Segment

And the ULS continues in similar fashion until I stopped the crawl and it cleaned itself up.  The Crawl Queue Chart looks like this:

So here we see that crawl has identified nearly 8 million links in less than three hours on my small VM and was also crawling the items.  With an appropriately sized farm, I'm certain this would successfully crawl all the items without issuing any queries that brought more than 100,000 results back while also inserted the items into the temp table for further processing without first accumulating all of the rows for each Entity.  

I believe this demonstrates that we can configure BCS to crawl large SQL Server data sets without having to write code to replace the delivered SQL Server Connector. 

No comments: