• Marek Vavrovic

Power Query - Extracting web data using URL parts - Building functions to fetch web data.

Updated: Oct 22

Web source: Database index (planecrashinfo.com)


Goal is to extract the accident details for each year and date. That means get a access to level 3 data.

Step 1

Bring the Accident Details page in.


Step 2

Pivot

Go to: Transform > Pivot > Advanced options > Don't Aggregate


When data has been pivoted you can decide which columns (15) you want to keep or remove. You can also do data cleaning transformations. I will not demonstrate that, my goal is just focus on extracting the data.

Step 3

Get years.

URL : http://www.planecrashinfo.com/database.htm


Unpivot Columns

Keep just column with years, remove bottom rows (8 empty rows)

By now I have the access to level 1 data. Next task is to get access to level 2. Through level 2 I will be able to reach on data in level 3.

URL : http://www.planecrashinfo.com/1920/1920.htm


Step 4

Get Level 2 data. Remove all the columns and keep just one, dates. I will use these data to build a function to unlock data in level 3.


Step 5

After I have removed all the other columns I transform this dataset into a function.

Step 6


(Year)=>

let

Source = Web.Page(Web.Contents("http://www.planecrashinfo.com/"&Year&"/"&Year&".htm")),


As the URL is a text, Year variable must be a text too.

Step 7

In step 4 I extracted LEVEL 2 data. I will return them back here using tblHelp function.

Step 8

For year 1920 there are 51 date records in that table. I will add an index to each date.

Step 9


We add an Index to the Date column.

=>Add custom column


=Table.AddIndexColumn([Custom],"Index",1,1)

Expand IndexingCustomColumn column.

Remove unnecessary columns.

Now we have this key to get to access to Level 3 data, AccidentDetails.


Step 10

I convert this dataset into a function.

(year, index)=>

let

Source = Web.Page(Web.Contents("http://www.planecrashinfo.com/"&year&"/"&year&"-"&index&".htm"))


This function needs 2 arguments, which will be supplied form AirAcciddents table.

Step 11

Add Custom column, use AccidentDetails function.

Expand the column names from the FinalTable and bring them in.

Now I have all the accident details in one table.


Conclusion: It takes 5 hours to Power BI Desktop to load all the rows. MS Excel can load these data in 1 hour.





11 views0 comments

Recent Posts

See All

Index

Information functions Filter functions Table manipulation functions Grouping and summarizing Time intelligence functions OPENINGBALANCEYEAR OPENINGBALANCEYEAR(<Expression>,<Dates> [,<Filter>][,<YearEn