Monday, June 29, 2015

BCS Models - Part 8: Crawl Results

This post is part of an eight part series describing the process I have followed to be able create a BCS model capable of indexing well over 10 million items.

Series Index:
BCS Models - Part 1: Target Database
BCS Models - Part 2: Initial BCS External Content Types
BCS Models - Part 3: Crawl Results
BCS Models - Part 4: Bigger Database
BCS Models - Part 5: The Bigger Database
BCS Models - Part 6: How to eat this elephant?
BCS Models - Part 7: Changes to the BCS Model to support segmented crawl
BCS Models - Part 8: Crawl Results <-- You are here
This series of blog posts reaches its culmination here.  The proof that the BCS SQL Connector can be used to crawl extremely large LOB datasets in a way that doesn't kill the crawler.

I'm using the BCS Model created in BCS Models - Part 7: Changes to the BCS Model to support segmented crawl to crawl a database created from an extract from the March 2015 export from StackOverflow.StackExchange.com.

In order to have the crawls run quickly I've limited by segment size to 1000 rows and will be crawling the first few segments to show the mechanics of the process.  Once I've done that I'll try letting my little single-server farm attempt to index the entire corpus.

I'm going to use both SQL Server Profiler to capture the stored procedure invocations and ULSViewer to capture the c73i events.

Here's the results of my Full Crawl (limited to the first 5 segments on each of the segment entities):
In this trace we see that within 6/100's of a second all seven of the segments are being enumerated.  About a minute later we see the seven child Entities being crawl via their AssociationNavigator and finally SharePoint needs to query three specific posts.  This is where giving the MethodInstance a meaningful name is helpful.

In the crawl log we see that we indexed 35,043 items


The Crawl Queue Health Report shows how fast the links from the segment accumulated.


Now to investigate the impact of the segments on the crawler.  Brian Pendergast shows we can use eventid dw3a if we enable verboseex on the search crawler stuff at Crushing the 1-million-item-limit myth with .NET Search Connector [BDC].

Looking in the MSSCrawlURL DB with a series of queries: 
select * from msscrawlurl where ParentDocID = -1
select cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1

select cu.*
  from MSSCrawlURL cu
join (
select top 1 cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1
) a on a.DocID = cu.ParentDocID

select cu.parentdocid, count(cu.docid)
from MSSCrawlURL cu
join (
select cu.*
  from MSSCrawlURL cu
join (
select top 1 cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1
) a on a.DocID = cu.ParentDocID
) b on cu.ParentDocID = b.DocID
group by cu.ParentDocID

select cu.DocId, cu.ParentDocID, cur.DisplayURL, cur.ErrorLevel,cur.ErrorDesc
from MSSCrawlURL cu
join (
select top 1 cu.*
  from MSSCrawlURL cu
join (
select top 1 cu.* 
  from MSSCrawlURL cu
join MSSCrawlURL cu2 on cu2.docid = cu.ParentDocID
where cu2.ParentDocID = -1
) a on a.DocID = cu.ParentDocID
) b on cu.ParentDocID = b.DocID
join MSSCrawlURLReport cur on cur.URLID = cu.DocID
I get results like:
  1. The first result set is the root of the crawl.
  2. The second result set is the records generated by enumerating the seven segment entities (Finder).  
  3. The third result set is showing the first 5 segments to be enumerated from the first segment entity.
  4. The fourth result set is showing the number of items enumerated in each of the segments for the first Entity
  5. The fifth result is the first few of the 1000 items crawled.

I ran a full crawl and enabled verboseex tracing on all trace entries with *crawl* in the name.  The ULSViewer shots below are filter to event ids c73i and dw3a.

Here's the first link being written (dw3a) that the whole crawl seems to hang off of.  This record has the -1 SourceDocID:

This is followed by seven more links being written, one for each of the Segment Entities.  The line highlighted shows the SourceDocID of 1. All seven of these do.

This is followed by the seven Segment Entities' MethodInstances being invoked

These are followed by more link inserts, for the Entities in each Entity Segment

And the ULS continues in similar fashion until I stopped the crawl and it cleaned itself up.  The Crawl Queue Chart looks like this:

So here we see that crawl has identified nearly 8 million links in less than three hours on my small VM and was also crawling the items.  With an appropriately sized farm, I'm certain this would successfully crawl all the items without issuing any queries that brought more than 100,000 results back while also inserted the items into the temp table for further processing without first accumulating all of the rows for each Entity.  

I believe this demonstrates that we can configure BCS to crawl large SQL Server data sets without having to write code to replace the delivered SQL Server Connector. 

Sunday, June 28, 2015

BCS Models - Part 7: Changes to the BCS Model to support segmented crawl

This post is part of an eight part series describing the process I have followed to be able create a BCS model capable of indexing well over 10 million items.

Series Index:
BCS Models - Part 1: Target Database
BCS Models - Part 2: Initial BCS External Content Types
BCS Models - Part 3: Crawl Results
BCS Models - Part 4: Bigger Database
BCS Models - Part 5: The Bigger Database
BCS Models - Part 6: How to eat this elephant?
BCS Models - Part 7: Changes to the BCS Model to support segmented crawl <-- You are here
BCS Models - Part 8: Crawl Results

In BCS Models - Part 6: How to eat this elephant? I created a number of stored procedures to be used to segment the data so we can take smaller bites of our huge elephant of data, stackoverflow.

Here's a representative entity describing the QuestionSegment Entity.  There doesn't seem to be a good way to deal with the wide XML, so it's a bit ugly.  

<Entity Namespace="stackexchange.so" Version="1.0.0.0" EstimatedInstanceCount="10000" Name="QuestionSegment" DefaultDisplayName="QuestionSegment">
  <Properties>
    <Property Name="DefaultAction" Type="System.String">View Profile</Property>
  </Properties>
  <Identifiers>
    <Identifier TypeName="System.Int32" Name="segmentNumber" />
  </Identifiers>
  <Methods>
    <Method IsStatic="false" Name="usp_getQuestionSegmentsRead List">
      <Properties>
        <Property Name="BackEndObject" Type="System.String">usp_getQuestionSegments</Property>
        <Property Name="BackEndObjectType" Type="System.String">SqlServerRoutine</Property>
        <Property Name="RdbCommandText" Type="System.String">[dbo].[usp_getQuestionSegments]</Property>
        <Property Name="RdbCommandType" Type="System.Data.CommandType, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">StoredProcedure</Property>
        <Property Name="Schema" Type="System.String">dbo</Property>
      </Properties>
      <Parameters>
        <Parameter Direction="Return" Name="usp_getQuestionSegmentsRead List Return">
          <TypeDescriptor TypeName="System.Data.IDataReader, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" IsCollection="true" Name="usp_getQuestionSegmentsRead List Collection">
            <TypeDescriptors>
              <TypeDescriptor TypeName="System.Data.IDataRecord, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" Name="usp_getQuestionSegmentsRead ListElement">
                <TypeDescriptors>
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" IdentifierName="segmentNumber" Name="segmentNumber" />
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" Name="lowerID" />
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" Name="upperID" />
                </TypeDescriptors>
              </TypeDescriptor>
            </TypeDescriptors>
          </TypeDescriptor>
        </Parameter>
      </Parameters>
      <MethodInstances>
        <MethodInstance Type="Finder" ReturnParameterName="usp_getQuestionSegmentsRead List Return" Default="true" Name="usp_getQuestionSegmentsRead List Instance">
          <Properties>
            <Property Name="RootFinder" Type="System.String"></Property>
            <Property Name="UseClientCachingForSearch" Type="System.String"></Property>
          </Properties>
        </MethodInstance>
      </MethodInstances>
    </Method>
    <Method IsStatic="false" Name="usp_getQuestionSegmentRead Item">
      <Properties>
        <Property Name="BackEndObject" Type="System.String">usp_getQuestionSegment</Property>
        <Property Name="BackEndObjectType" Type="System.String">SqlServerRoutine</Property>
        <Property Name="RdbCommandText" Type="System.String">[dbo].[usp_getQuestionSegment]</Property>
        <Property Name="RdbCommandType" Type="System.Data.CommandType, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">StoredProcedure</Property>
        <Property Name="Schema" Type="System.String">dbo</Property>
      </Properties>
      <Parameters>
        <Parameter Direction="In" Name="@segmentNumber">
          <TypeDescriptor TypeName="System.Int32" IdentifierName="segmentNumber" Name="@segmentNumber" />
        </Parameter>
        <Parameter Direction="Return" Name="usp_getQuestionSegmentRead Item Return">
          <TypeDescriptor TypeName="System.Data.IDataReader, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" IsCollection="true" Name="usp_getQuestionSegmentRead Item Collection">
            <TypeDescriptors>
              <TypeDescriptor TypeName="System.Data.IDataRecord, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" Name="usp_getQuestionSegmentRead ItemElement">
                <TypeDescriptors>
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" IdentifierName="segmentNumber"
                     Name="segmentNumber" />
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" Name="lowerID" />
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" Name="upperID" />
                  <TypeDescriptor TypeName="System.Int64" ReadOnly="true" Name="DeletedCount" />
                </TypeDescriptors>
              </TypeDescriptor>
            </TypeDescriptors>
          </TypeDescriptor>
        </Parameter>
      </Parameters>
      <MethodInstances>
        <MethodInstance Type="SpecificFinder" ReturnParameterName="usp_getQuestionSegmentRead Item Return"
          ReturnTypeDescriptorPath="usp_getQuestionSegmentRead Item Collection[0]" Default="true"
          Name="usp_getQuestionSegmentRead Item Instance">
          <Properties>
            <Property Name="DeletedCountField" Type="System.String">DeletedCount</Property>
            <Property Name="UseClientCachingForSearch" Type="System.String"></Property>
          </Properties>
        </MethodInstance>
      </MethodInstances>
    </Method>
  </Methods>
  <Actions>
    <Action Position="1" IsOpenedInNewWindow="false" Url="http://sp2013lab:80/sites/ECT/_bdc/stackexchange_so/QuestionSegment_2.aspx?segmentNumber={0}" ImageUrl="/_layouts/15/1033/images/viewprof.gif" Name="View Profile">
      <LocalizedDisplayNames>
        <LocalizedDisplayName LCID="1033">View Profile</LocalizedDisplayName>
      </LocalizedDisplayNames>
      <Properties>
        <Property Name="IsTaskpaneAction" Type="System.Boolean">true</Property>
        <Property Name="Office Version" Type="System.String">15</Property>
      </Properties>
      <ActionParameters>
        <ActionParameter Index="0" Name="segmentNumber[0]">
          <Properties>
            <Property Name="IdOrdinal" Type="System.Byte">0</Property>
          </Properties>
        </ActionParameter>
      </ActionParameters>
    </Action>
  </Actions>
</Entity>



I've highlighted three lines.  The first is the RootFinder property on the Finder MethodInstance.  This indicates to the crawler to start crawling here.  Each of the Segment Entities will have this, causing crawl to go after all of them at the same time.

The second highlighted line, TypeDescriptor for DeletedCount is a System.Int64.  This is a field in support of incremental crawls telling the crawler how many deleted rows there are.  The data sources I've ever used didn't really delete any data so I've always made my supporting SQL return a 0 cast as a BigInt.  The BigInt is required by BCS.

The third highlight is in support of Incremental crawls and tells BCS which of the of the fields is the DeletedCountField.

Here's the Question Entity:
<Entity Namespace="stackexchange.so" Version="1.1.0.0" EstimatedInstanceCount="10000" Name="Question" DefaultDisplayName="Question">
  <Properties>
    <Property Name="DefaultAction" Type="System.String">View Profile</Property>
  </Properties>
  <Identifiers>
    <Identifier TypeName="System.Int32" Name="ID" />
  </Identifiers>
  <Methods>
    <Method IsStatic="false" Name="usp_GetQuestionsBySegment AssociationNavigator">
      <Properties>
        <Property Name="BackEndObject" Type="System.String">usp_GetQuestionsBySegment</Property>
        <Property Name="BackEndObjectType" Type="System.String">SqlServerRoutine</Property>
        <Property Name="RdbCommandText" Type="System.String">usp_GetQuestionsBySegment</Property>
        <Property Name="RdbCommandType" Type="System.Data.CommandType, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">StoredProcedure</Property>
        <Property Name="Schema" Type="System.String">dbo</Property>
      </Properties>
      <FilterDescriptors>
        <FilterDescriptor Type="Input" Name="LastCrawlTime">
          <Properties>
            <Property Name="CrawlStartTime" Type="System.String"></Property>
          </Properties>
        </FilterDescriptor>
      </FilterDescriptors>
      <Parameters>
        <Parameter Direction="In" Name="@segmentNumber">
          <TypeDescriptor TypeName="System.Int32" IdentifierName="segmentNumber"
           IdentifierEntityName="QuestionSegment" IdentifierEntityNamespace="stackexchange.so"
           ForeignIdentifierAssociationName="usp_GetQuestionsBySegment AssociationNavigator Instance"
           Name="@segmentNumber" />
        </Parameter>
        <Parameter Direction="In" Name="@lastRunDate">
          <TypeDescriptor TypeName="System.DateTime" AssociatedFilter="LastCrawlTime" Name="lastModifiedTime">
            <Interpretation>
              <NormalizeDateTime LobDateTimeMode="Local" />
            </Interpretation>
          </TypeDescriptor>
        </Parameter>
        <Parameter Direction="Return" Name="usp_GetQuestionsBySegment Return">
          <TypeDescriptor TypeName="System.Data.IDataReader, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" IsCollection="true" Name="uv_AllQuestionsRead List">
            <TypeDescriptors>
              <TypeDescriptor TypeName="System.Data.IDataRecord, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" Name="uv_AllQuestionsRead ListElement">
                <TypeDescriptors>
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" IdentifierName="ID" Name="ID" />
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="ParentId" />
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="AnswerCount" />
                  <TypeDescriptor TypeName="System.String" Name="Body">
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToEmptyString" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.DateTime, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="ClosedDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="CommentCount" />
                  <TypeDescriptor TypeName="System.Nullable`1[[System.DateTime, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="CommunityOwnedDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.DateTime" Name="CreationDate">
                    <Properties>
                      <Property Name="RequiredInForms" Type="System.Boolean">true</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="FavoriteCount" />
                  <TypeDescriptor TypeName="System.DateTime" Name="LastActivityDate">
                    <Properties>
                      <Property Name="RequiredInForms" Type="System.Boolean">true</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.DateTime, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="LastEditDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="LastEditorDisplayName">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">40</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Int32" Name="Score">
                    <Properties>
                      <Property Name="RequiredInForms" Type="System.Boolean">true</Property>
                    </Properties>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="Tags">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">150</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="Title">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">250</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Int32" Name="ViewCount">
                    <Properties>
                      <Property Name="RequiredInForms" Type="System.Boolean">true</Property>
                    </Properties>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="DisplayName">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">40</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="PostType">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">50</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Int32" Name="segmentNumber" />
                </TypeDescriptors>
              </TypeDescriptor>
            </TypeDescriptors>
          </TypeDescriptor>
        </Parameter>
      </Parameters>
      <MethodInstances>
        <Association Name="usp_GetQuestionsBySegment AssociationNavigator Instance" Type="AssociationNavigator" ReturnParameterName="usp_GetQuestionsBySegment Return">
          <Properties>
            <Property Name="DirectoryLink" Type="System.String"></Property>
            <Property Name="ForeignFieldMappings" Type="System.String">
              &lt;?xml version="1.0" encoding="utf-16"?&gt;
              &lt;ForeignFieldMappings xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&gt;
              &lt;ForeignFieldMappingsList&gt;
              &lt;ForeignFieldMapping ForeignIdentifierName="segmentNumber" ForeignIdentifierEntityName="QuestionSegment" ForeignIdentifierEntityNamespace="stackexchange.so" FieldName="segmentNumber" /&gt;
              &lt;/ForeignFieldMappingsList&gt;
              &lt;/ForeignFieldMappings&gt;
            </Property>
            <Property Name="LastModifiedTimeStampField" Type="System.String">LastEditDate</Property>
            <Property Name="UseClientCachingForSearch" Type="System.String"></Property>
          </Properties>
          <SourceEntity Namespace="stackexchange.so" Name="QuestionSegment" />
          <DestinationEntity Namespace="stackexchange.so" Name="Question" />
        </Association>
      </MethodInstances>
    </Method>
    <Method IsStatic="false" Name="usp_getPostByID Question">
      <Properties>
        <Property Name="BackEndObject" Type="System.String">usp_getPostByID</Property>
        <Property Name="BackEndObjectType" Type="System.String">SqlServerRoutine</Property>
        <Property Name="RdbCommandText" Type="System.String">[dbo].[usp_getPostByID]</Property>
        <Property Name="RdbCommandType" Type="System.Data.CommandType, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">StoredProcedure</Property>
        <Property Name="Schema" Type="System.String">dbo</Property>
      </Properties>
      <Parameters>
        <Parameter Direction="In" Name="@postID">
          <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" IdentifierName="ID" Name="@postID" />
        </Parameter>
        <Parameter Direction="Return" Name="usp_getPostByID">
          <TypeDescriptor TypeName="System.Data.IDataReader, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" IsCollection="true" Name="usp_getPostByID">
            <TypeDescriptors>
              <TypeDescriptor TypeName="System.Data.IDataRecord, System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" Name="usp_getPostByIDElement">
                <TypeDescriptors>
                  <TypeDescriptor TypeName="System.Int32" ReadOnly="true" IdentifierName="ID" Name="ID" />
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="ParentId" />
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="AnswerCount" />
                  <TypeDescriptor TypeName="System.String" Name="Body">
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToEmptyString" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.DateTime, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="ClosedDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="CommentCount" />
                  <TypeDescriptor TypeName="System.Nullable`1[[System.DateTime, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="CommunityOwnedDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.DateTime" Name="CreationDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="FavoriteCount" />
                  <TypeDescriptor TypeName="System.DateTime" Name="LastActivityDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.DateTime, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="LastEditDate">
                    <Interpretation>
                      <NormalizeDateTime LobDateTimeMode="UTC" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="LastEditorDisplayName">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">40</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Int32" Name="Score" />
                  <TypeDescriptor TypeName="System.String" Name="Tags">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">150</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="Title">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">250</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Int32" Name="ViewCount" />
                  <TypeDescriptor TypeName="System.String" Name="DisplayName">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">40</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.String" Name="PostType">
                    <Properties>
                      <Property Name="Size" Type="System.Int32">50</Property>
                    </Properties>
                    <Interpretation>
                      <NormalizeString FromLOB="NormalizeToNull" ToLOB="NormalizeToNull" />
                    </Interpretation>
                  </TypeDescriptor>
                  <TypeDescriptor TypeName="System.Nullable`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]" Name="AcceptedAnswerId" />
                </TypeDescriptors>
              </TypeDescriptor>
            </TypeDescriptors>
          </TypeDescriptor>
        </Parameter>
      </Parameters>
      <MethodInstances>
        <MethodInstance Type="SpecificFinder" ReturnParameterName="usp_getPostByID" ReturnTypeDescriptorPath="usp_getPostByID[0]" Default="true" Name="usp_getPostByID Question Instance">
          <Properties>
            <Property Name="LastDesignedOfficeItemType" Type="System.String">None</Property>
          </Properties>
        </MethodInstance>
      </MethodInstances>
    </Method>
  </Methods>
  <AssociationGroups>
    <AssociationGroup Name="QuestionSegment-Question">
      <AssociationReference AssociationName="usp_GetQuestionsBySegment AssociationNavigator Instance" Reverse="false" EntityNamespace="stackexchange.so" EntityName="Question" />
    </AssociationGroup>
  </AssociationGroups>
  <Actions>
    <Action Position="1" IsOpenedInNewWindow="false" Url="http://sp2013lab:80/sites/ECT/_bdc/stackexchange_so/Question_2.aspx?ID={0}" ImageUrl="/_layouts/15/1033/images/viewprof.gif" Name="View Profile">
      <LocalizedDisplayNames>
        <LocalizedDisplayName LCID="1033">View Profile</LocalizedDisplayName>
      </LocalizedDisplayNames>
      <Properties>
        <Property Name="IsTaskpaneAction" Type="System.Boolean">true</Property>
        <Property Name="Office Version" Type="System.String">15</Property>
      </Properties>
      <ActionParameters>
        <ActionParameter Index="0" Name="ID[0]">
          <Properties>
            <Property Name="IdOrdinal" Type="System.Byte">0</Property>
          </Properties>
        </ActionParameter>
      </ActionParameters>
    </Action>
  </Actions>

</Entity>

I'll enumerate the differences between this Entity Definition and the prior generation Entity Definition:

  • The Finder Method has been removed.
    • This is to ensure that the crawler won't crawl this entity directly, hence undercutting all our good work to segment the data crawl.  Remember the crawler looks for Entities that have the RootFinder property on a Finder or an Entity that has both a SpecificFinder and Finder method defined to crawl.
  • The ChangedIdEnumerator and DeletedIdEnumerator methods have been removed.
    • Even if they are provided, the crawler won't call them.  
  • A new Association Method is defined to represent the Association from the QuestionSegment Entity to the Question entity.  The AssociationMethod has a property named DirectoryLink.
    • This is the whole purpose of the new model.  
    • The presence of the DirectoryLink causes the Crawler to treat the Source of the Association as a Directory or Container.
    • Each Container Enumeration is processed independently of other Container Enumerations.  This is what gives us the multiple, smaller result sets that enables the Crawler to use less memory and survive the encounter.
  • We have a Filter and Parameter on the new Association method
    • <FilterDescriptor Type="Input" Name="LastCrawlTime">
        <Properties>
          <Property Name="CrawlStartTime" Type="System.String"></Property>
        </Properties>
      </FilterDescriptor>
    • This is in support of the incremental crawl.  
    • The Property CrawlStartTime causes SharePoint to provide the last time the previous crawl of the current crawl type was performed, except for Full Crawls.  I've seen either '1900-01-01 00:00:00' or '1899-12-31 18:00:00' be passed into the filter.
      • The significance here is that the first Incremental Crawl will function like a Full Crawl in that the same CrawlStartTime value is passed in.
  • We have a new In Parameter specified
    • <Parameter Direction="In" Name="@lastRunDate">
        <TypeDescriptor TypeName="System.DateTime" AssociatedFilter="LastCrawlTime"
        Name="lastModifiedTime">
          <Interpretation>
            <NormalizeDateTime LobDateTimeMode="Local" />
          </Interpretation>
        </TypeDescriptor>
      </Parameter>
    • This is in support of the incremental crawl.  
    • This takes the filter value and associates it with the parameter, passing it to the backend where we can use it in our Stored Procedure to limit our results.
  • We have also defined the 'LastModifiedTimeStampField' property.
    • This enables the crawler to perform the incremental crawl even on the first incremental run.  It will use this field value to compare to the records already present in the index.  Having this present enables the Crawler to not have to replace all of the data it read in the Full Crawl, increasing the speed of the process.
The other Entities all follow this pattern.  The entire model is available for download here.

5/7/2017- Link changed to github