Migrating (any) Site Content to Sitecore Content Hub One (2024)

August 25, 2023

Sergey Yatsenko

Sr. Director, Sitecore MVP

Introduction

This post is describing some of my recent experiments with Content Hub One JavaScript SDK. In this case, I needed to import all blogs written by XCentium employees over the last decade from na older website to Content Hub One and convert them to Markdown format.

These blog posts have been written over more than a decade, and their source content varied greatly, making it very hard to perform a simpler scripted migration from one CMS to another. Instead, I opted to screen-scrap blog contents directly from the original website and convert that to Markdown format, which is a source format of choice for the xBlog module, which I created with Andre Moraes earlier this year.

This post, plus Updating Content Hub One Content with JavaScript SDK and a Little Help from AI, can be a good starting point for someone just starting with Content Hub One JavaScript APIs.

Implementation approach

The approach I took is fairly simple and comes down to the following steps:

List all blog post URLs from the source site (I chose to use sitemap.xml from the source site)
Iterate through the list of blog URLs and for each blog, perform the following steps:
- Read the contents of the blog page, extract text content and images from the page body
- Upload all images to Content Hub One, publish them, and retrieve their public links
- Extract page contents using the Puppeteer library for Node.js
- Convert HTML content to Markdown using Turndown
- Improve blog format with ChatGPT (e.g. re-arrange titles and paragraphs, format code sections, etc.)
- Save the prepared blog content to Content Hub One and move on to the next blog

This might sound like a lot of work, but I found most of these steps to require only a handful of lines of code.

The most challenging part in my case had to do with extracting actual content from all kinds of different content pages, looking similar to each other, but written in all kinds of different formats and residing in all kinds of content sources. I’ll skip this one part because this is unique to each source website, and this may or may not be a challenge in your particular case.

Importing site content into Markdown

I found the Puppeteer library for Node.js to be a super helpful tool for screen-scraping the site contents. After installing the puppeteer npm package, it can be used like so to read page content:

const puppeteer = require("puppeteer");//...const browser = await puppeteer.launch();const page = await browser.newPage();//...await page.goto(url, { waitUntil: "networkidle0" });const pageHtml = await page.content();

Extracting the images

Below code snipped is reading all images from source HTML and saving them to an array to be later uploaded to Content Hub One

const images = $(pageHtml).find("img");const imageItems = [];images.each((i, img) => { const imageUrl = $(img).attr("src"); if ( imageUrl?.trim() && (imageUrl.startsWith("https") || imageUrl.startsWith("http")) ) { const imageName = getFileName(imageUrl); const folderPath = \$"{pagesRoot}/${fileName}.images"; const imagePath = $"{folderPath}/${imageName}"; if (imageUrl.startsWith("https") || imageUrl.startsWith("http")) { imageItem = { src: $(img).attr("src"), localFolder: folderPath, localPath: imagePath, alt: $(img).attr("alt"), }; imageItems.push(imageItem); } } else { $(img).remove(); }});

The above code is just a high-level gist - I had to write a bit more code to ensure I’m only reading the blog content and ignoring all site headers, footers, promos, and other things unrelated to actual blog content.

Now, we need to convert the page to Markdown and for this, I opted to use the Turndown, which can be installed with npm install turndown.

Converting clean HTML to Markdown is extremely simple with Turndown, here’s few snippets, which might help to get started:

var TurndownService = require('turndown');var turndownService = new TurndownService();var markdown = turndownService.turndown(pageHtml);

Pushing Blog Content and Images to Content Hub One

I outlined the process of installing and setting up Content Hub One JavaScript client in this blog post: Updating Content Hub One Content with JavaScript SDK and a Little Help from AI, so I’ll skip those steps here.

Below is a code snippet to upload and publish a media item (and image or file) to Content Hub One. I'm using webcrypto to generate random UUIDs for new Content Hub items.

export async function createMediaItem(client, fullPath, itemName, description) { const uuid = webcrypto.randomUUID(); const item = new MediaItem(uuid, { name: itemName, description: description, }); await client.media.createAsync(item, new FileUploadSource(fullPath)); console.log("created media item id: ", uuid); await client.media.publishAsync(uuid); console.log("published media item id: ", uuid);}

Now, I would need to read those images back, in order to get their public links, to be injected into the blog content:

//...const mediaItems = await getMediaItems(client); const mediaItemsLookup = mediaItems.map((mediaItem) => { return { id: mediaItem.id, name: mediaItem.name, description: mediaItem.description, url: mediaItem.file?.uri, }; });//...export async function getMediaItems(client) { const allItems = []; let getMore = true; let pageNumber = 1; while (getMore) { const response = await client.media.getAsync( new ContentItemSearchRequest().withPageNumber(pageNumber++) ); allItems.push(...response.data); getMore = response.totalCount > response.pageNumber * 20; } return allItems;}

I will skip code bits and snippets to massage blog content and replace the original image URLs with their new (public links) URLs in the Content Hub as this would be a bit much for a single blog post. So, moving on to the final update step: create content item in Content Hub One:

export async function createItem( client, contentTypeId, contentItem, publish*tem) { const result = await client.contentItems.createAsync( contentTypeId, contentItem ); console.log("created content item id: ", result?.id); if (result && result.id && publish*tem) { await client.contentItems.publishAsync(result.id); console.log("published media item id: ", result.id); }}export async function createMediaItem(client, fullPath, itemName, description) { const uuid = webcrypto.randomUUID(); const item = new MediaItem(uuid, { name: itemName, description: description, }); await client.media.createAsync(item, new FileUploadSource(fullPath)); console.log("created media item id: ", uuid); await client.media.publishAsync(uuid); console.log("published media item id: ", uuid);}

The ContentItem object represents Content Hub’s content item, which can be initialized with code like below. In this case, I have a number of custom string and reference fields, added to my Blog Post type in the Content Hub One schema

the postSlug, postTitle, and postDescription are custom string fields
postTags and postAuthors are the reference fields pointing to a list of blog tags and authors
heroImages and thumbnailImage are the reference fields pointing to media items, associated with a given blog post

blogContentItem = new ContentItem(uuid, { name: blogSlugName, fields: { postSlug: new ShortTextField( blogSlugName ? blogSlugName.substring(0, 998) : "" ), postTitle: new ShortTextField( blogMetadataListingItem.titleText ? blogMetadataListingItem.titleText.substring(0, 998) : "" ), postDescription: new ShortTextField( blogMetadataListingItem.description ? blogMetadataListingItem.description.substring(0, 998) : "" ), }, });blogContentItem.fields.postTags = new ReferenceField(tags);blogContentItem.fields.postAuthors = new ReferenceField(authors);const isoDate = moment(blogMetadataDetailsItem.date).toISOString();blogContentItem.fields.createdDate = new DateTimeField(isoDate);//...blogContentItem.fields.heroImages = new MediaField([ { id: titleImage.id },]);blogContentItem.fields.thumbnailImages = new MediaField([ { id: thumbnailImage.id },]);

Useful Links

xBlog module
Updating Content Hub One Content with JavaScript SDK and a Little Help from AI
Getting started with Content Hub One JavaScript SDK