The Fubra Blog
Furnish.co.uk – how we automatically categorise content
From the outside, furnish.co.uk looks pretty simple; thousands of fabulous interiors products from multiple different stores, all easily searchable and nicely categorised.
But, in the background, there’s some seriously clever stuff going on. We’re automatically scraping stores’ websites (with their permission of course) or scanning their feeds at regular intervals, and then subjecting this data to heavy processing to determine category, style, colour, materials, etc and also doing some very leading edge stuff to make our custom-developed search engine produce super-accurate results.
For me, the most important thing about furnish.co.uk is that when a user types “wooden coffee tables” into the search or navigates to coffee tables or table lamps, that’s precisely what they get. They don’t see a coffee jug under coffee tables. I know of no sites that do this particularly well.
So, I thought I’d write about how we achieve some of this stuff – to start with, how we automatically and accurately categorise the thousands of products into our category hierarchy; how we know a bed is a bed, a wall light is a wall light, etc. I’m not the main technical guy at furnish.co.uk – that’s Alan – so I’ll keep things at a relatively non-technical level. But, hopefully, you’ll find it pretty interesting.
Categories and clones
We’ve constructed a hierarchy of hundreds of categories, covering every interiors product for the home. We did this by analysing other sites that sell home interiors products, e.g. John Lewis, Graham and Green, Marks and Spencer, Heals and then put together something that we thought worked.
We decided to let users have multiple ways to navigate through the category hierarchy to the same product for ease of use. However, for simplicity, each product only actually has a single category. That’s because we have ‘clone categories’, where a category can exist in multiple places. For example, Rooms -> Bedrooms -> Bedroom furniture -> Bedside tables shows the same items as Products -> Furniture -> Bedroom furniture -> Bedside tables. One is a clone of the other and means the item itself only needs a single category.
Assigning categories
Based on the above, we automatically assign categories to items. For each item imported, we do the following:
- We pull several fields out of the item that we think will give us a hint to its category and we prioritise these fields. For example, we may think that the item name is most likely to give us a hint, followed by the description. The choice of fields and their priority varies by supplier; some suppliers actually have a categories field that we can use.
- Next, we take each of these fields in turn (highest priority first) and attempt to determine the category. We do this using a HUGE synonyms library that we’ve painstakingly put together from scratch, where each category has a set of associated synonyms. You can see a screen shot from our back end system below. Furthermore, each synonym is prioritised. So, the system finds all synonyms contained within the field that it’s analysing, but then chooses the one with the highest priority. There are also negative synonyms, i.e. where an item cannot be a certain category if it contains certain words. In the event the field contains multiple synonyms with equal priority, it uses the one that comes first in the field.
- If there are no matches, it moves onto the next field within the item to see if that contains a synonym.
- This is repeated for all items being imported. On rare occasions, no synonyms are found and items end up with no category. However, these are not published.
We’ve found that our processing and synonyms library is sufficiently good that any new supplier coming on board tends to get 95% correct categorisation. We then manually tweak the synonyms to ensure 100% accuracy for the supplier.
All pretty elaborate, but the result is clear and accurate navigation. We also use some of this information for the search index, but more on that in a future post.
Sitemap Validation
Introduction
Hello, my name is Alex Buell, I am profoundly deaf, and work as a Linux
system administrator within the Fubra infosphere. I usually spend most
of my time working on open source projects, giving back to the
community extra value in tools that allow us to do our job.
What are sitemaps?
They provide a way for webmasters (people who run
websites) to give out information about the content on their websites.
Search engines (i.e. www.google.co.uk) look (‘crawling’) through
websites to build up indexes to allow people to search for things that
they are interested in looking for.
Howto: Setup a Mac Mini as a BGP Router
Thinking Differently… An update on our Mac Mini Routers at LINX
We have been quiet for a while on the subject of the mac minis we installed into LINX at Telehouse several months ago…
You may remember the previous article, basically we are using a pair of Mac Mini computers to connect our hosting platform to the LINX Internet exchange in London.
Open Source Windows Applications
Who said there was no such thing as a free lunch? In my experience I have found that the some of best Windows programs are in fact the ones that are completely free, (winscp, putty, firefox etc…) which is why I like this idea so much. The OpenCD is a collection of high quality free and open source software designed to run on Windows. If you don’t feel brave enough to make the switch to Linux for your main system, then this is a good introduction to open source software that will run on your current Windows machine.
Freeware CD burning
Do you find it a pain having to pay for a cd burning program like Nero when you just want to rip or burn an iso file. Well I’ve found the answer… Check out ISO Recorder 2 Beta.

