Page indexing
This feature implements indexing of Content by scraping the corresponding web page and indexing its content into the
configured search engine. It’s implemented for solr and elastic search engines.
Configuration
To enable this feature, set up the page indexing configuration:
netgen_ibexa_search_extra:
page_indexing:
enabled: true
sites:
site1:
tree_root_location_id: '%site1.locations.tree_root.id%'
language_siteaccess_map:
cro-HR: cro
eng-GB: eng
fields:
level_1:
- h1
level_2:
- h2
- h3
- div.short
level_3:
- div.item-short
allowed_content_types:
- ng_article
- ng_frontpage
host: "%env(PAGE_INDEXING_HOST)%"
site2:
tree_root_location_id: '%site2.locations.tree_root.id%'
language_siteaccess_map:
cro-HR: cro
eng-GB: eng
fields:
level_1:
- h1
level_2:
- h2
- h3
- div.short
level_3:
- div.item-short
allowed_content_types:
- ng_landing_page
host: "%env(PAGE_INDEXING_HOST)%"
To activate the feature, set the enabled parameter to true. Define the individual page sites under the sites
array parameter. The available options are:
tree_root_location_idrequiredThis is an integer defining the root Location of the site you are configuring.
language_siteaccess_mapoptionalDefines a map of language codes to siteaccess name. When indexing a Content item in a specific language, the URL for scraping will be generated for the configured siteaccess.
fieldsoptionalDefines a map of indexed fields. Each field contains an array of simplified CSS selectors that will be used to map the text content of the scraped page to the given field.
allowed_content_typesoptionalDefines a list ContentType identifiers for Content items that will be indexed as pages. This is a whitelist, meaning if you don’t configure anything here, nothing will be indexed as page.
hostoptionalIn case when it’s needed to scrape the page from host that is different from the one configured for the siteaccess, you can use this parameter.
For each Content item and language, the configuration will be processed by site sequentially until a match is found. That means you can use different configuration for a particular language under the same tree root.
Note
Note that source from which the text will be extracted must be delimited by XML comments
<!--begin page content--> and <!--end page content-->.
Indexing
This feature is intended for Content items whose pages are built not only from the Content item itself, but from other Content items as well. Typically, these are landing pages of various kind. That means there is not mechanism to reindex the page when one of the participating Content item updates. For that reason, you should to set up periodic reindexing of the pages using the provided command.
Since scraping might not be performant of successful in all cases, you should consider using asynchronous indexing
feature also provided by netgen/ibexa-search-extra.
Command
A command netgen-search-extra:index-pages is provided to reindex all Content items configured for page indexing.
The command is used to execute initial page indexing when the feature is first added to the project. It goes through
all content types specified in the configuration (allowed_content_types) and indexes all existing Content items
of the specified types through their pages:
bin/console netgen-search-extra:index-pages
The command also has an option content-ids, used to reindex only the given Content items by their IDs:
bin/console netgen-search-extra:index-pages --content-ids=24,42