- Marek Vavrovic
Power Query - Extracting web data using URL parts - Building functions to fetch web data.
Updated: Oct 22, 2021
Web source: Database index (planecrashinfo.com)
Goal is to extract the accident details for each year and date. That means get a access to level 3 data.
Bring the Accident Details page in.
Go to: Transform > Pivot > Advanced options > Don't Aggregate
When data has been pivoted you can decide which columns (15) you want to keep or remove. You can also do data cleaning transformations. I will not demonstrate that, my goal is just focus on extracting the data.
URL : http://www.planecrashinfo.com/database.htm
Keep just column with years, remove bottom rows (8 empty rows)
By now I have the access to level 1 data. Next task is to get access to level 2. Through level 2 I will be able to reach on data in level 3.
URL : http://www.planecrashinfo.com/1920/1920.htm
Get Level 2 data. Remove all the columns and keep just one, dates. I will use these data to build a function to unlock data in level 3.
After I have removed all the other columns I transform this dataset into a function.
Source = Web.Page(Web.Contents("http://www.planecrashinfo.com/"&Year&"/"&Year&".htm")),
As the URL is a text, Year variable must be a text too.
In step 4 I extracted LEVEL 2 data. I will return them back here using tblHelp function.
For year 1920 there are 51 date records in that table. I will add an index to each date.
We add an Index to the Date column.
=>Add custom column
Expand IndexingCustomColumn column.
Remove unnecessary columns.
Now we have this key to get to access to Level 3 data, AccidentDetails.
I convert this dataset into a function.
Source = Web.Page(Web.Contents("http://www.planecrashinfo.com/"&year&"/"&year&"-"&index&".htm"))
This function needs 2 arguments, which will be supplied form AirAcciddents table.
Add Custom column, use AccidentDetails function.
Expand the column names from the FinalTable and bring them in.
Now I have all the accident details in one table.
Conclusion: It takes 5 hours to Power BI Desktop to load all the rows. MS Excel can load these data in 1 hour.