How to parse HTML table using jsoup?

What I would do in your case is first create an Object of your machine with all apropriate attributes. Then using Jsoup I would extract data and create an ArrayList, and then use logic to get data from the Arraylist.

I am skipping the Object creation (since it is not the issue here) and I will name the Object as Machine

Then using Jsoup I would get the row data like this:

ArrayList<Machine> list = new ArrayList();
Document doc = Jsoup.parse(url, 3000);
for (Element table : doc.select("table")) { //this will work if your doc contains only one table element
  for (Element row : table.select("tr")) {
    Machine tmp = new Machine();
    Elements tds = row.select("td");
    tmp.setClusterName(tds.get(3).text());
    tmp.setIp(tds.get(4).text());
    tmp.setStatus(tds.get(7).text());
    //.... and so on for the rest of attributes
    list.add(tmp);
  }
}

Then use a loop to get the values you need from the list:

for(Machine x:list){
  if(x.getStatus().equalsIgnoreCase("up")){
    //machine with UP status found
    System.out.println("The Machine with up status is:"+x.getHostName());
  }
}

That's all. Please also note that this code is not tested and may contain some syntactical errors as it is written directly on this editor and not in an IDE.


Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr> tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th> tags and get the specific index. In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name). Check the status if it equals to down and if it is, then add the Host Name to a list. That's all.

ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");

for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
    Element row = rows.get(i);
    Elements cols = row.select("td");

    if (cols.get(7).text().equals("down")) {
        downServers.add(cols.get(5).text());
    }
}

Update: When you find the word Titan you can create another loop and look if the cluster name is empty.

Edit: I change the while loop to do while loop.

    ArrayList<String> downServers = new ArrayList<>();
    Element table = doc.select("table").get(0); //select the first table.
    Elements rows = table.select("tr");

    for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
        Element row = rows.get(i);
        Elements cols = row.select("td");

        if (cols.get(3).text().equals("Titan")) {
            if (cols.get(7).text().equals("down"))
                downServers.add(cols.get(5).text());

            do {
                if(i < rows.size() - 1)
                   i++;
                row = rows.get(i);
                cols = row.select("td");
                if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
                    downServers.add(cols.get(5).text());
                }
                if(i == rows.size() - 1)
                    break;
            }
            while (cols.get(3).text().equals(""));
            i--; //if there is two Titan names consecutively.
        }
    }

downServers ArrayList will contain the list of down servers hostnames.


The below is a clean generic function to extract an html table into a simple list map structure.

Pass the document to this function with table order asking for the nth table in the html page.

The function will not return accurate data if the table makes use of rowspan or colspan.

public static List<Map<String,String>> parseTable(Document doc, int tableOrder) {
    Element table = doc.select("table").get(tableOrder);
    Elements rows = table.select("tr");
    Elements first = rows.get(0).select("th,td");

    List<String> headers = new ArrayList<String>();
    for(Element header : first)
        headers.add(header.text());

    List<Map<String,String>> listMap = new ArrayList<Map<String,String>>();
    for(int row=1;row<rows.size();row++) {
        Elements colVals = rows.get(row).select("th,td");
        //check column size here

        int colCount = 0;
        Map<String,String> tuple = new HashMap<String,String>();
        for(Element colVal : colVals)
            tuple.put(headers.get(colCount++), colVal.text());
        System.out.println(tuple.toString());
        listMap.add(tuple);
    }
    return listMap;
}