Exploring a new data source

Article

•

May 17, 2023

Exploring a new dataset with the open source business dataset from Bigpicture.io

Rich Gibson, Engineer, Hexvarium

May 27, 2023

Exploring the open source business dataset from Bigpicture.io

Our job at Hexvarium is to use data to help our customers to make better decisions about building fiber networks. That means that anytime we see a new datasource we want to pull it into our tools and take a look.

On May 17th, 2023 BIGPICTURE announced their release of the World’s largest open source company dataset

I don’t know the channels by which our CEO learned of this release, but he promptly put a message on our developers Slack channel asking if we could pull it for the US, saying it ’looks interesting.' I have an unfortunate tendency to take statements of desire, like this, as commands, and so I promptly downloaded the data. And I thought that perhaps our process of loading and exploring new datasets might be of more general interest.

A link to download the data and the schema is in their blog post announcing the availability of the data. Once the data is downloaded, and uncompressed, it is ready to load into Google’s BigQuery

Loading BigPicture data Google BigQuery

The ‘obvious’ way to load data into BigQuery is to use the –autodetect flag. Unfortunately, BigQuery treats ZIP Codes and Census Blocks as integers, thus dropping leading zeroes.

A little trick is to make a copy of the data file with just the header and a few lines of data. Load that file with –autodetect, then download the autodetected schema and edit it manually, and use the edited schema to load the full data file.

Make a copy of the file and edit it manually to be just a few rows long. Or you can use the ‘head’ tool from a Linux or OSX command line.

head -10 companies-dataset-2023-02-ckgENv.csv > small.csv

Upload the small file with –autodetect

bq load --source_format=CSV --skip_leading_rows=1  --autodetect wip.small  small.csv

Retrieve the schema from BigQuery

bq show --schema --format=prettyjson wip.small > bigpicture.schema

In this case the autodetect created schema is correct! So we can load the full file into BigQuery.

bq load --source_format=CSV --skip_leading_rows=1 wip.companies-dataset-2023-02 companies-dataset-2023-02-ckgENv.csv bigpicture.schema

Exploring the data.

A query to show the distribution of the data by country.

select country_code, count(country_code) as cnt 
from `wip.companies-dataset-2023-02`
group by country_code
order by country_code;

The top ten countries represented in the data

country_code	cnt
US	4059985
GB	1232403
IN	885302
FR	756411
BR	664235
DE	467501
NL	414433
ES	410372
CA	389535

Full results of the count by country

The free dataset has relatively few fields, but their paid dataset, which they describe in the blog post has a number of additional fields.

About Hexvarium

Hexvarium is a broadband service provider using proprietary Data Science to accurately identify, deliver, and connect profitable networks across all geographies. Building sustainable broadband networks is a complex puzzle of continually changing factors. Yet all other designers, engineers, and operators deploy capital using static, antiquated methods and knowledge of each community. Hexvarium’s approach delivers sustainable networks, even in the most challenging circumstances. For more information, please visit hexvarium.com.

Media Contact:

Jennifer Spoerri

Gallagher PR

jennifer@gallagherpr.com

(415) 577-0171