Exploring a new data source
Article
•
May 17, 2023

Exploring a new dataset with the open source business dataset from Bigpicture.io
Rich Gibson, Engineer, Hexvarium
May 27, 2023
Exploring the open source business dataset from Bigpicture.io
Our job at Hexvarium is to use data to help our customers to make better decisions about building fiber networks. That means that anytime we see a new datasource we want to pull it into our tools and take a look.
On May 17th, 2023 BIGPICTURE announced their release of the World’s largest open source company dataset
I don’t know the channels by which our CEO learned of this release, but he promptly put a message on our developers Slack channel asking if we could pull it for the US, saying it ’looks interesting.' I have an unfortunate tendency to take statements of desire, like this, as commands, and so I promptly downloaded the data. And I thought that perhaps our process of loading and exploring new datasets might be of more general interest.
A link to download the data and the schema is in their blog post announcing the availability of the data. Once the data is downloaded, and uncompressed, it is ready to load into Google’s BigQuery
Loading BigPicture data Google BigQuery
The ‘obvious’ way to load data into BigQuery is to use the –autodetect flag. Unfortunately, BigQuery treats ZIP Codes and Census Blocks as integers, thus dropping leading zeroes.
A little trick is to make a copy of the data file with just the header and a few lines of data. Load that file with –autodetect, then download the autodetected schema and edit it manually, and use the edited schema to load the full data file.
Make a copy of the file and edit it manually to be just a few rows long. Or you can use the ‘head’ tool from a Linux or OSX command line.
head -10 companies-dataset-2023-02-ckgENv.csv > small.csv
Upload the small file with –autodetect
bq load --source_format=CSV --skip_leading_rows=1 --autodetect wip.small small.csv
Retrieve the schema from BigQuery
bq show --schema --format=prettyjson wip.small > bigpicture.schema
In this case the autodetect created schema is correct! So we can load the full file into BigQuery.
bq load --source_format=CSV --skip_leading_rows=1 wip.companies-dataset-2023-02 companies-dataset-2023-02-ckgENv.csv bigpicture.schema
Exploring the data.
A query to show the distribution of the data by country.
select country_code, count(country_code) as cnt
from `wip.companies-dataset-2023-02`
group by country_code
order by country_code;
The top ten countries represented in the data
country_code | cnt |
---|---|
US | 4059985 |
GB | 1232403 |
IN | 885302 |
FR | 756411 |
BR | 664235 |
DE | 467501 |
NL | 414433 |
ES | 410372 |
CA | 389535 |
Full results of the count by country
The free dataset has relatively few fields, but their paid dataset, which they describe in the blog post has a number of additional fields.
About Hexvarium
Hexvarium is a broadband service provider using proprietary Data Science to accurately identify, deliver, and connect profitable networks across all geographies. Building sustainable broadband networks is a complex puzzle of continually changing factors. Yet all other designers, engineers, and operators deploy capital using static, antiquated methods and knowledge of each community. Hexvarium’s approach delivers sustainable networks, even in the most challenging circumstances. For more information, please visit hexvarium.com.
Media Contact:
Jennifer Spoerri
Gallagher PR