/2026/05/31/app-surface-data-collection-notes

Notes on App-Surface Data Collection

中文

Some product data is easy to fetch from a web endpoint. Some of it only exists on the app screen. The interesting design problem is not just “how do I scrape it”, but how to build a small system that can combine both surfaces without becoming fragile.

That mobile-only part changes the architecture. If the signal is only visible in the app, the collector needs real Android workers. ADB becomes the control plane: open the app, navigate to the product, dump the UI, parse the screen, and report the result back into the same product data model.

For a shopping or content platform, a useful internal tool usually needs three kinds of information:

  • stable identifiers, like item IDs and canonical URLs
  • structured public data, like title, shop, price, sales text, and SKU variants
  • app-only signals, like badges, tags, ranking hints, recent cart activity, or visible social proof

These do not arrive through one clean API. The design has to accept that reality and make the collection path observable, retryable, and boring enough to operate.

1. Normalize Input Before Doing Anything Expensive

The first boundary is input normalization.

Users paste messy things: short links, long links, copied share text, deep links, or just an item ID. The rest of the system should not care. It should receive one canonical product identity.

function resolveProductInput(input):
text = trim(input)

if text contains direct goods id:
return { itemId, source: "direct" }

urls = extractUrls(text)

for url in urls:
if url contains goods id:
return { itemId, source: "url", resolvedUrl: url }

if url is short link:
finalUrl = followRedirect(url)
if finalUrl contains goods id:
return { itemId, source: "short_link", resolvedUrl: finalUrl }

raise "no product id found"

This keeps every downstream operation simple. The web collector, app collector, product table, query history, and metric jobs all speak in item IDs.

I like this pattern because it moves chaos to the edge. The user can paste almost anything, but inside the system there is one durable key.

2. Split Web Data From App Detail Data

The second design choice is separating cheap web data from expensive app detail data.

Web data is usually faster and safer:

  • product title
  • sales text
  • shop name
  • shop URL
  • location
  • SKU variants
  • price fields

App UI collection is slower and more operationally fragile:

  • it needs a real Android device or emulator
  • it depends on current UI structure
  • it can fail because of loading states, login state, app updates, or network conditions
  • it consumes a scarce device slot

So the default flow should be layered:

function addProduct(input, includeDetail):
resolved = resolveProductInput(input)
product = upsertProduct(resolved.itemId)

product.webStatus = "running"
product = refreshWebSnapshot(product)

if includeDetail:
enqueueAppDetailTask(product)
product.detailStatus = "pending"

return product

This gives the user a useful record quickly. App detail collection becomes an optional enrichment step instead of a blocker for the whole product.

3. Treat Devices as a Pool, Not a Constant

Remote Android collection should not assume one perfect device. The reason for having Android workers at all is that some information is only exposed through the mobile app surface. Once a phone becomes part of the data path, it should be modeled as infrastructure, not as a hidden script dependency.

A better model is a small device pool. Each device has metadata:

  • name
  • ADB serial
  • phone IP
  • reverse SSH port
  • app package
  • online state
  • busy or idle state
  • last successful collection time

Then task dispatch can be conservative:

function collectMobileOnlySignals(product):
device = selectIdleDevice(product.preferredDeviceId)
if device is null:
enqueue(product)
return "pending"

with adbSession(device.serial):
openProductInApp(product.itemId)
xml = dumpUi()
tags = parseVisibleTags(xml)
saveEvidence(product, xml, tags, device)

function selectIdleDevice(preferredDeviceId = null):
statuses = adbDevices()
busy = queryRunningTaskDeviceIds()
devices = activeDevicesOrderedById()

if preferredDeviceId exists:
device = find(devices, preferredDeviceId)
if device is online and device.id not in busy:
return device
return null

for device in devices:
if device.id in busy:
continue
if statuses[device.serial] == "device":
return device

return null

This matters because the device is not just an implementation detail. It is a scarce worker. If two tasks use the same phone at the same time, the UI state becomes non-deterministic.

4. Make App Collection a Queue

Once devices are treated as workers, app detail collection naturally becomes a queue.

The task record should store:

  • item ID
  • input and canonical URL
  • status: pending, running, success, failed
  • assigned device
  • API key or user that requested it
  • started and finished timestamps
  • extracted tags
  • raw UI nodes
  • error message
  • elapsed time

The dispatcher can stay simple:

function dispatchPendingTasks():
tasks = oldestPendingTasks(limit = 10)

for task in tasks:
device = selectIdleDevice(task.preferredDeviceId)
if device is null:
continue

task.status = "running"
task.deviceId = device.id
task.startedAt = now()
save(task)

runTask(task.id)

function runTask(taskId):
task = loadTask(taskId)
device = loadDevice(task.deviceId)

try:
result = collectFromApp(task.itemId, device.serial)
task.status = "success"
task.tags = result.tags
task.rawItems = result.items
task.elapsedMs = result.elapsedMs
task.finishedAt = now()
markProductDetailSuccess(task.productId)
saveMetrics(task.productId, result.tags)
markDeviceSeen(device)
catch error:
task.status = "failed"
task.error = error.message
task.finishedAt = now()
markProductDetailFailed(task.productId)

save(task)
dispatchPendingTasks()

The point is not to build a large job system too early. The point is to make the state explicit enough that failures are visible and retryable.

5. Parse the App Screen as Evidence

App UI collection is different from API collection. The screen is the data source.

A practical approach is:

  1. open the product deep link
  2. wait for the screen to settle
  3. dump the UI XML
  4. scan text and content descriptions
  5. filter likely tag strings
  6. use screen bounds to ignore irrelevant regions
  7. swipe horizontally when tags overflow
  8. deduplicate while preserving order

Pseudocode:

function collectFromApp(itemId, serial):
openDeepLink(serial, "app://goods_detail/" + itemId)
waitForSettledScreen()

screen = getScreenSize(serial)
allItems = []
seen = set()
lastTagY = null

for round in 0..maxSwipes:
xml = dumpUiXml(serial)

if pageShowsLoadError(xml):
raise "product page load failed"

items = extractCandidateTags(xml, screen.height)

for item in items:
allItems.append(item)
seen.add(item.text)

if noNewTagsFound(seen):
break

lastTagY = medianY(items)
swipeTagArea(serial, y = lastTagY)
waitForSettle()

return uniqueByText(allItems)

This is intentionally heuristic. UI extraction is not a clean contract. The best design is to preserve enough raw evidence to debug bad matches later.

That is why storing raw UI node information is useful. It turns “the collector missed something” into a problem that can be inspected.

6. Do Not Store Only Final Tags

Final tags are convenient, but they are not enough.

For analysis, I want three layers:

  • the product record
  • the query or collection history
  • extracted metrics over time

The product record answers “what is this item now?” The query history answers “what happened during collection?” The metric table answers “how did a signal change?”

For example, a tag like 24小时内123人加购 is both display text and structured data. It should be saved as raw text, but it can also become a metric:

patterns = [
("24h_cart", /24小时内(\d+)\+?人加购/),
("7d_fav", /近7天新增(\d+)\+?人收藏/)
]

function saveMetrics(product, tags):
for tag in tags:
for dimension, pattern in patterns:
if pattern matches tag:
insertMetric(product.id, product.itemId, dimension, capturedNumber)

This makes the tool more than a scraper. It becomes a small time-series surface for product signals.

7. Keep the Console Operational

The frontend should feel like an operations console, not a marketing dashboard.

The important screens are practical:

  • product list with filters
  • detail enrichment status
  • latest query result
  • device status and busy state
  • API key management
  • marketplace collection progress
  • metric summary and history

The UI should show state instead of hiding it. pending, running, success, and failed are not backend details. They are the product.

When collection depends on a remote phone, silence is expensive. A user needs to know whether the device is offline, busy, waiting, collecting, or failed.

8. Keep Secrets and Permissions Narrow

Even a small internal tool needs a security shape.

The simple version is enough:

  • raw API keys are shown once
  • only SHA-256 hashes are stored
  • keys can expire
  • keys can be revoked
  • device management and key management are separate permissions

Pseudocode:

function createKey(name, expiresAt, permissions):
raw = "sk-" + secureRandom()

record = {
name,
keyPrefix: firstVisiblePart(raw),
keyHash: sha256(raw),
expiresAt,
permissions,
status: "active"
}

save(record)
return raw

function verifyKey(raw):
if raw does not start with "sk-":
return null

record = findByHash(sha256(raw))

if record is missing or revoked or expired:
return null

record.lastUsedAt = now()
return record

This keeps operational access simple without storing reusable secrets in plain text.

Current Rule

For app-surface data collection, my rule is:

normalize early, collect in layers, store evidence, extract metrics late

The useful system is not the one that perfectly scrapes every screen once. It is the one that can explain what it tried, what it found, which worker did it, why it failed, and which signals are worth tracking over time.